Wasm is the new CGI | Roborooter.com

I read a wonderful twitter thead about CGI and the birth of the web and this triggered a thought I've been kicking around.

Wasm is the new CGI

And to be clear I don't mean the Common Gateway Interface as a protocol. I mean what CGI and the cgi-bin application model brought to the web. They allowed people to easily write code that makes websites interactive. This shifted the web from an archive of documents to a vast network of applications. It was the first "web application model". I think Wasm (WebAssembly) is setup to bring the next "web application model" to the industry.

Every shift in common web application models that I've observed over the last 20 years has been towards one goal; High performance applications that are easier to build and maintain. I almost titled this essay as "Wasm is the new serverless" but "serverless" is just the current hot iteration of web application models and not the goal.

I'm going to give a history of web application models and then I'm going to talk about what Wasm brings to the table. I'm seeing a convergence of Wasm related technologies maturing that I think will change some of the fundamental constraints of web application development today. When you change the constraints in a system you enable things that were impossible before.

From CGI to Serverless

There was a time (and honestly still today) where you could throw a script or any executable in a folder named cgi-bin and visit it at a url. The web server would execute the program on demand and return the output of the program to your web browser. This is roughly how the dynamic web worked for it's first decade. Make a request, run a process to respond to the request. It was slow by today's standards as starting a new process (and in some cases parsing scripts too) is a lot of work.

The official CGI logo from the spec announcement, yes really

FastCGI was developed as a response to the performance problems. It was a new web application model where a long lived processes would listen for CGI requests. Your web server would talk to one or more of these processes instead of invoking a new one for every request. It has its own follies as a lot of applications would be naively ported from the previous model and would leak resources like a sieve. They were built to have their processes terminated after each request and now had to stay alive for a long time. This lead to a long period of transition where languages and frameworks adjusted to the new reality.

FastCGI

During this time language based web servers would emerge as a convention. (Possibly inspired by the Apache Tomcat era of Java application servers.) These applications would be built around a request/response model, and have different web servers available for different execution strategies. They were usually designed to go behind a "battle tested" feature rich web server like Apache or nginx that would isolate the application from slow requests and HTTP details.

The general application models were centered around process management with regards to requests. Some of these servers would fork the process on every request, some would use OS or language based threads, others would use an event or reactor models.

The Rack web server interface from the Ruby community eventually made into python via the Flask application server and the WSGI specification. To simplify the specification, you receive an HTTP method, a headers hashmap, and a string or stream of input bytes. In response you send a status code, a headers object and a string or stream of response bytes.

With all these approaches you have to manage the number of physical or virtual servers you run. With physical servers application servers would often get slow when traffic was high, or have lots of computers sitting idle when traffic is low. With the rise of cloud computing, autoscaling could allow a number of application servers to be based upon CPU or memory load or even the time of day. This allowed you to adapt to changes in traffic over time, turning on machines when you needed them and off when you didn't. Scaling up new computers can take 2-20 minutes depending on many factors of your application and configuration. Additionally it's dependant on the resources available in your cloud hosting region.

The introduction of "Serverless compute" with Amazon Lambda changed the game. Instead of managing servers you now managed "functions". When paired with API Gateway you now had a web server that would guarantee a single processes, isolated CPU and isolated memory for every request. Processes might be reused for up to a couple of hours but would be suspended or destroyed when not in use. This approach removes the concept of servers from the application management and allows AWS to scale up and down based upon request volume in seconds.

Amazon started the serverless age of compute with Lambda

This of course has its tradeoffs. New processes as we know from CGI are expensive, which lead to a "cold start" that some requests observe while scaling up concurrent requests. Each platform has different strategies to mitigate this penalty to varying success. Additionally since processes can be suspended after a response maintaining persistent tcp connections between requests can be troublesome. In practice you can tune database or cache server connections to stay alive (high timeouts on the server, low timeouts on the client) but you need to be more tolerant of reconnecting to external services, or simply reconnect for every request. HTTP based database APIs (see Azure or DynamoDB) are popular in the serverless application model as they tolerate massive numbers of connections and are easier to scale up and down with functions than traditional RDBMs solutions.

Another tradeoff is the dedicated CPU and memory that you get with each request. Some workloads thrive in this model, you don't have to manage these resources if they're guaranteed. And you might lower costs and mitigate scaling concerns. Other workloads perform horribly in this model as the CPU and memory go to waste as a single process could be leveraged for a considerable number of requests, or possibly the shared memory of a server model allows for batching or caching that make processing significantly more performant.

Anecdotally, I've both brought costs down 90% by moving a CMS based web application to a serverless model, and reduced costs 90% moving an event analytics based service to a server based model.

There are many variations of "Severless" such as Google Cloud Run or Google Cloud Functions which lets you have a single process to take any number of concurrent requests. They model this with a Docker container but the tradeoffs are mostly the same. (Lambda these days now has docker support too but traditionally supported a zip file with an executable or script.)

Lastly, the Rack and WSGI specifications heavily influenced the request/response models we see in serverless environments today. With the request and response tuples becoming function or api signatures. Initially most function services did not provide streaming requests or responses (not to mention websockets or Server Side Events) but this has started to change and is worth exploring as application frameworks are starting to demand it.

Wasm on the Server

With that context, you might ask;

Why on earth are we talking about Wasm? Isn't it for the browser?

And I really hope even my mention of that question becomes dated, but I still hear this question quite often so it's worth talking about. Wasm was initially developed to run high perfromant code in the web browser. There's a history that traces to asm.js and other attempts to get code to run really fast in the browser. I'll let webassembly.org speak for itself.

WebAssembly (abbreviated Wasm) is a binary instruction format for a stack-based virtual machine. Wasm is designed as a portable compilation target for programming languages, enabling deployment on the web for client and server applications. The Wasm stack machine is designed to be encoded in a size- and load-time-efficient binary format. WebAssembly aims to execute at native speed by taking advantage of common hardware capabilities available on a wide range of platforms

What we have today is the ability to compile many languages to Wasm instructions that can be run in both the browser and the server. While running CPU intensive processes in the browser (like Doom) is valuable the isolation and security model that the browser demanded is incredible for server side applications. It's now possible to have a significantly lighter weight form of isolation for untrusted code than a VM or docker container.

Web Assembly Logo

Additionally since V8 based serverless environments (like Node.js, Cloudflare Workers and Deno) are common, we already have some very mature Wasm execution capabilities thanks to the work in the browser. Wasm native environments are few (Fastly, Shopify and Suborbital) I think we'll see many more in the coming years thanks to advances in tooling.

If a Wasm module is instructions for a virtual machine, then you need a virtual machine to execute these instructions. This comes in the form of "runtimes" that will take generic Wasm, compile it for your local architecture, and provide it an execution environment. Some of these environments look like POSIX APIs that you'd find on any linux system, and some will do nothing but provide specific functions from the "host" system and allow you to execute exported functions in the module itself. These runtimes are available via libraries in many languages and run in process (or threads or anywhere you like).

Regardless of your runtime, WebAssembly programs are organized into modules and the VM running the module is called a "host".

Modules are the unit of deployment, loading, and compilation. A module collects definitions for types, functions, tables, memories, and globals. In addition, it can declare imports and exports and provide initialization in the form of data and element segments, or a start function.

"Memories" are represented as a uninterrupted contiguous array of bytes; which are allocated by the host at instantiation time, giving each guest module memory isolation. They function as the RAM for your virtual machine. You can provide them empty or pre-fill them with data segments. One of the many effect of how modules are isolated is that you can "pause" a module, and save its memory as a data segment. A similar concept to a Snapshot of a virtual machine. You can then start as many copies of the paused module as you like. (As I tell friends, it's like saving your game in an emulator.)

The snapshotted module has no extra startup time. The leading utility to perform this is called Wizer and describes the process like so;

First we instantiate the input Wasm module with Wasmtime and run the initialization function. Then we record the Wasm instance's state: What are the values of its globals? What regions of memory are non-zero? Then we rewrite the Wasm binary by intializing its globals directly to their recorded state, and removing the module's old data segments and replacing them with data segments for each of the non-zero regions of memory we recorded.

If we go back to thinking about our Application Server models; this allows us to have a fresh process but without paying the startup costs of a new process. Essentially giving us CGI without the downsides of CGI. Or in more recent terms, serverless without cold starts. This is how Wasm is the new CGI.

Tradeoffs of Wasm

Like any tech choice, it does have some tradeoffs.

Threads are not a native construct, this forces any blocking operation to host methods. This could mean the host handles the bulk of IO operations, providing wrappers for reading and writing to files or network interfaces, pausing the module as convenient or providing callback handlers. It is possible to build a reactor model (eg, tokio, nodejs) or a blocking model for your application. Until the thread proposal lands the constraint of not being able to have threads moves a significant amount of this design to your execution environment.

A process with two threads of execution, running on one processor from Wikipedia

Just in Time (JIT) compilation is not possible as dynamic Wasm code generation is not allowed for security reasons. In fact, code itself is not addressable at runtime, which required for traditional JIT compilers as they generate code to replace "hot paths" of an interpreted script. This means systems like V8 and CRuby (and many other scripting environments) which rely on a JIT compiler for performance aren't able to run in a Wasm VM or have to abandon their JIT. There are alternative approaches that borrow from JIT. For example a "pre-jit" build step that outputs an optimized runtime for your script as been proposed. But they are not in wide use yet.

Since Wasm runs in a VM there is a simple interface between a module and its host; Memory. As a result moving data between a Wasm module and its host may require a copy. It is possible to share chunks of memory but depending on how the runtime models memory this may or may not be possible or recommended. (Wasmtime in rust for example uses a vector of bytes which may change out from under you.) As far as I can tell with most run times you cannot have zero copy communication between a module and IO operations. This means streaming data into and out of a Wasm module may be slower than doing it on the host.

Wasm VMs however do provide a much higher level of control. Some runtimes can enforce CPU limits by counting CPU instructions (see the Wasmtime fuel concept) which is really cool. All of the VMs are able to limit memory and wall clock time. So if you're looking to control usage limits, it's rather trivial to do.

Upcoming features to Wasm such as Interface Types (usable today but not a ratified standard) and module linking (functioning prototypes but no standard approach) help reduce module size and improve IO speed and ergonomics. But they are not standard yet. Wizer currently can interfere with module linking, and while there are custom approaches to solving that problem, there isn't a clear winner. Interface Types provide langue agnostic objects that can move through the Wasm memory boundary without expensive encoding and decoding, today a common approach is to copy JSON into and out of memory.

I've hand waved over security but by default Wasm modules only have access to what they're given. It's generally safe to run untrusted Wasm code and the surface area of the VM is quite small. This is not the case with Docker or other isolation models. Timing attacks are possible since the VM will be running translated instructions on the host's hardware, but there are mitigations for such attacks. It is also possible to compile Wasm to a native binary, the surface for attack there is much higher, but is safer than running untrusted native code. (I'm not up on the tradeoffs of this approach.)

The Future

Already we're seeing the rise of Wasm execution environments. As they (and their dev tools) get more popular, they will drive scripting languages to have Wasm runtimes and "Wizer like" preboots. In theory your application could be faster to run in snapshotted copy of your app that resumes on each request than with other models available today. Even on our own computers we could in theory take a CLI written in ruby and ship it in a snapshotted Wasm module that links to a Wasm ruby runtime and have it startup nearly as fast as c++ utility and ensure it only ever operates within the confines of a project directory.

Right now I'm seeing enhancements to our existing models and new platforms that are experimenting with what's possible.

If we go way back this was your application server. The Jacquard machine (from Wikipedia)

The first major enhancement I'm observing is moving "functions" to the "edge" allowing for compute near your users instead of near your database. Edge functions are a new primitive that application frameworks can take advantage of. (Vercel Edge Functions being the one I'm working on - which is v8 based but shares many principals.) For example next-auth can control access to pre rendered pages based upon JWT login tokens. You're able to get dynamic personalized content that is composed of data cached at the CDN.

Another enhancement to existing models (and maybe a natural experiment of a new model) is replacing processes based functions with Wasm based functions for serverless applications. Suborbital (which allows you to build your own Wasm based function execution environment) handles function execution but also allows for innovative chaining of Wasm functions into workflows. Most execution platforms seem to encourage a single module per request ignoring the ability to share memory or quickly invoke multiple modules (even from different languages!). As interface types become standard (very soon as they're part of the Wasm 2.0 spec) I expect to see a "middleware" data model (similar to Rack?) emerge.

It's worth noting an interesting confluence of technologies is that you are able to execute Wasm inside a Lambda function. You're able to use a snapshotted version of your app on existing infrastructure. While it is an interesting mix of technologies that shouldn't be ignored, I imagine it will disappear eventually.

I don't know what the next web application model looks like but I think we're at a tipping point. Wasm improves performance, makes process level security much easier, and lowers the cost of building and executing serverless functions. It can run almost any language and with module linking and interface types it lowers the latency between functions incredibly. When you change the constraints in a system you enable things that were impossible before. All this is very exciting to me and I'm quite eager to help us find out where it goes.