Let’s talk about performance
I’m willing to discuss some of the performance trade-offs that come with different approaches, but the moral of the story is that performance is complex and very use case dependent. One of the main trade-off considerations is the CPU to memory ratio, but the memory side of this ratio can be very complex and confusing.
One of the most rewarding (for me at least) parts of being a software engineer is mentoring junior developers and helping them learn new concepts and the wider implications of technical decisions. It’s also fun to create a learning environment, occasionally letting the cocky developer fall on his face in the dirt, sort of like “pay it forward” when I was a young, cocky developer.
A perfect example is when a green developer challenges your recommendations (in reality, as a senior developer, you always make the wrong choice in the eyes of others) and bets that his approach is the best approach. I don’t know everything, but I’ve been in the business long enough to see burdock. How could I resist? I will accept this bet. And years later, I’ll write a post about it.
I honestly don’t remember the specifics (a few years have passed), but I do remember recommending using Node.js primarily based on the existing development team’s knowledge base, available libraries, and other technical needs. Junior developers wanted to showcase the “crazy” skills of trendy computer science bachelors. Maybe they knew I wasn’t very good at computer science and assumed that I just didn’t have a clue how computers actually work (to be honest, after ~20 years I’ve come to the conclusion that it’s just magic).
And as a result…
Wait a minute, but how?
If you can’t guess “why?”, don’t worry. In my experience, most developers don’t know why either. The result contradicts the general rule that compiled languages are faster than interpreted languages, and static programs are faster than dynamic ones. But this is just a rule of thumb.
“Optimized” is the key word in my answer above, since a naive C++ program can quickly go off the rails. On the other hand, Node.js (using V8 and C++/C based libuv) has made great strides in optimizing stupid JS to run fast, meaning there are cases where naive JS can outperform naive C++. But it is clearly more difficult.
Ah yes, memory…
Most developers should be familiar with the ideas of stacks and heaps, but many don’t delve into superficial characteristics like stack linearity and heaps with pointers (or something like that). They also probably overlooked that these are just concepts (and there are other approaches) with multiple implementations. Low-level hardware usually doesn’t know what the hell a heap is, because software determines how memory is managed*, and the choice made can have a huge impact on the performance characteristics of the final program.
* There is a whole rabbit hole that you can (and perhaps should) go down. Kernels can be complex, and modern hardware is far from stupid and can often include a number of special-purpose optimizations that can use high-level memory circuits in their optimizations. This may mean that the software can (or is forced to) delegate the memory management functions provided by the hardware. And this is not even about virtualization…
Today: Entering Rust
Rust is one of my favorite languages today. It has a lot of great modern features, it’s fast, and it has a great memory model, which allows you to write generally safe code. Of course it has flaws, compile time is still an issue and there are weird semantics here and there, but overall I highly recommend it.
One of the projects I’m currently working on is a FaaS (Function as a Service) host written in Rust that performs WASM (WebAssembly) functions. It is designed to run sandboxed functions very quickly, while minimizing the overhead of using FaaS. And it’s pretty fast, capable of getting 90K net requests per second per core. What’s more, it can do so with a total reference memory of ~20MB.
What does this have to do with Node.js and C++? Well, I use Node.js as a benchmark for “reasonable” performance (Go is used as a “dream” target, it’s hard to compare to a language designed for web services, adding the performance overhead of FaaS), and early versions of the program weren’t promising (although they used less than 10% of Node.js memory).
However, the deterrent was pretty obvious right from the start. It was memory management. An array of memory was allocated for each function, but there was a lot of performance overhead between allocating within the function, as well as copying data to and from function and host memory. Due to the dynamic data transfer, the allocator was clogged from all sides. Solution: cheat (well, like).
I love heaps, I’ll take two (or three)!
Essentially, a heap is just some memory for which an allocator controls the mapping. The program requests N units of memory, and the allocator will find them in its available memory pool (or request the host to allocate more memory), remember that the units are in use, and then return a pointer to the location of that memory. When the program is done with that memory, it tells the allocator, and the allocator then updates its display to know that those units are now available. Simple, right?
Problems start to arise when you allocate many units of memory of different sizes with different lifetimes, you end up with a lot of fragmentation, which increases the cost of allocating new memory. This is where you start to notice the performance hit as it is essentially its own program just to figure out where to store things. Obviously, there is no single solution to this problem, there are many different distribution algorithms from the buddy system to slabs and blocks. Each approach has trade-offs, meaning you can choose which one works best for your use case (or just choose the default one, as most people do).
Now you don’t have to choose only one approach to cheat. And for FaaS, you can choose not to allocate resources for each run and just clean up the entire heap after each run. And you can use different allocators for different parts of a function’s lifecycle, e.g. initialization vs. startup. This allows you to use either a pure function (resetting the same memory state on each run) or a stateful function (keeping the state between runs) and optimize each case using a different memory strategy.
For my FaaS project, we created a dynamic allocator that chooses a distribution algorithm based on usage, and that choice persists between runs. For “underused” functions (apparently most functions at the moment), a naive stack allocator is used, which simply maintains one pointer to the next free slot. When called
Deallocif the module is the last one on the stack, it simply rolls back the pointer, otherwise it is
noop. When the function is finished, the pointer is set to 0 (e.g. Node.js exit before garbage collection). If the function reaches a certain number of failed deallocations and a certain usage threshold, a different allocation algorithm is used for the rest of the calls. The result is very fast memory allocation in most cases.
There is also another “heap” used at runtime, namely the host – the function’s shared memory. It uses the same dynamic allocation strategy and allows you to write directly to the function’s memory, bypassing the copy step in earlier versions. This means that I/O is directly copied from the kernel to the guest function, bypassing the host runtime and greatly improving throughput.
Node.js vs Rust
After optimization, the Rust FaaS runtime is > 70% faster while using > 90% less memory than our Node.js reference implementation. But the key is that “after optimization” the original implementation was slower. And this required the imposition of some restrictions on the operation of the WASM functions, although they are transparently applied at compile time with rare incompatibilities.
The main advantage of the Rust implementation is the low memory footprint, all the extra RAM can be used for things like caching and distributed in-memory storage. This means it can be even faster in a production environment by reducing I/O overhead, which is probably more of a gain than a modest gain in CPU performance.
We have additional optimizations planned, but they mostly involve changes at the host level that have serious security implications. They’re also not directly related to memory management performance, but provide a lot of food for the “Rust is faster than Node” camp.
Not sure. I’m guessing a couple of things:
Memory management is interesting, and each approach has trade-offs. By playing with different strategies, you can get a huge performance boost.
In the end, you need to choose the best technology for your situation and this is rarely an easy answer, but understanding the different characteristics of different stacks can certainly help.
Thank you for attention!