Why is it so slow to spawn a new process in Node?

At Val Town, we run your code in Deno processes. Recently, we noticed that under load, a single Val Town Node server cannot spawn more than 40 processes. For 30% of the CPU time, the main thread remains blocked by calls to spawn. Why is it so slow? Is there any way to speed this up?

To reproduce the pattern of this situation, let's write an HTTP server that spawns a new process in response to each request. Something like this:

import { spawn } from "node:child_process";
import http from "node:http";
http
  .createServer((req, res) => spawn("echo", ["hi"]).stdout.pipe(res))
  .listen(8001);

We have already written similar implementations in Go (Here) and on Rust (Here), and we will do this example with Node, Deno and Bun.

I run all these examples on a Hetzner CCX33 processor with 8 virtual cores and 32 GB of RAM. For benchmark placement I use bombardierrunning on the same machine. To measure the benchmarks on each server, I run the command bombardier -c 30 -n 10000 http://localhost:8001. In total, there are 10,000 requests going through 30 connections. Before running benchmarks, I “warm up” each server. I work with Go v1.22.2, Rust v1.77.2, Node 22.3.0, Bun 1.1.20, and Deno 1.44.2.

Here's what I got:

Language/Runtime Environment

Requests/sec

Team

Node

651

node baseline.js

Deno

2 290

deno run –allow-all baseline.js

Bun

2 208

bun run baseline.js

Go

5 227

go run go/main.go

Rust (Tokyo)

5 466

cd rust && cargo run –release

Okay, Node is slow. Deno and Bun found a way to speed it up, and compiled languages ​​that use a thread pool are much faster again.

It is striking how low the performance is spawn in Node. It was interesting to read this branch and while in my experience things have improved since that discussion, Node still has to keep the main thread blocked for an insanely long time while each Spawn call is being made.

If we switch to Bun or Deno, the situation will improve significantly. It's great that this is so, but let's try to improve the situation with Node as well.

The cluster module in Node

The easiest way out in this case is to spawn more processes and for each process run http server using the cluster module provided by Node. Like this:

import { spawn } from "node:child_process";
import http from "node:http";
import cluster from "node:cluster";
import { availableParallelism } from "node:os";

if (cluster.isPrimary) {
  for (let i = 0; i < availableParallelism(); i++) cluster.fork();
} else {
  http
    .createServer((req, res) => spawn("echo", ["hi"]).stdout.pipe(res))
    .listen(8001);
}

Here Node provides network socket sharing between processes, so all our processes can listen on port :8001, and requests are routed in a roundabout way.

The main problem with this approach, in my opinion, is that each HTTP server is isolated in its own process. This can be complicated if you have to manage any in-memory caching or global state that needs to be shared between these processes. Ideally, you want to find a way to preserve JavaScript's single-threaded execution model, but still achieve fast thread spawning.

Here are the results:

Language/Runtime

Requests/sec

Team

Node

1,766

node cluster.js

Deno

2 133

deno run –allow-all cluster.js

Bun

n/a

”node:cluster is not yet implemented in Bun”

It couldn't be weirder. Deno is slower, Bun isn't working yet, and Node has improved a lot, but we expected it to be even faster.

It's nice to know that some acceleration has been achieved here. We'll now build on that.

Passing spawn calls to worker threads

If spawn calls are blocking the main thread, let's move them to worker threads.

Here is the code worker-threads/worker.js. We listen for messages containing a command and an id. We execute the command and send the results back using the method post. For convenience, we will use here execFilebut this is just an abstraction written on top spawn.

import { execFile } from "node:child_process";
import { parentPort } from "node:worker_threads";

parentPort.on("message", (message) => {
  const [id, cmd, ...args] = message;

  execFile(cmd, args, (_error, stdout, _stderr) => {
    parentPort.postMessage([id, stdout]);
  });
});

Here is the file worker-threads/index.js. We create 8 worker threads. When we want to process a request, we send a message to the thread, telling it to make a spawn call and send its output back. When we receive the response, we respond to the http request.

import assert from "node:assert";
import http from "node:http";
import { EventEmitter } from "node:events";
import { Worker } from "node:worker_threads";

const newWorker = () => {
  const worker = new Worker("./worker-threads/worker.js");
  const ee = new EventEmitter();
  // Выдаём сообщения от рабочего потока к EventEmitter по id.
  worker.on("message", ([id, msg]) => ee.emit(id, msg));
  return { worker, ee };
};

// Порождаем 8 рабочих потоков
const workers = Array.from({ length: 8 }, newWorker);
const randomWorker = () => workers[Math.floor(Math.random() * workers.length)];

const spawnInWorker = async () => {
  const worker = randomWorker();
  const id = Math.random();
  // Отправляем и ожидаем отклика
  worker.worker.postMessage([id, "echo", "hi"]);
  return new Promise((resolve) => {
    worker.ee.once(id, (msg) => {
      resolve(msg);
    });
  });
};

http
  .createServer(async (_, res) => {
    let resp = await spawnInWorker();
    assert.equal(resp, "hi\n"); // no cheating!
    res.end(resp);
  })
  .listen(8001);

Results!

Language/Runtime Environment

Requests/sec

Team

Node

426

node worker-threads/index.js

Deno

3,601

deno run –allow-all worker-threads/index.js

Bun

2,898

bun run worker-threads/index.js

Node is slower! Well, we can assume that Node's bottleneck can't be circumvented by just using threads. So we're doing the same work as before, but incurring additional overhead in coordinating worker threads. Bummer.

Deno likes this, and Bun likes it a little more. In principle, it is nice to see that Bun and Deno have little to improve here. They are already good in that they do not burden the execution flow with the overhead associated with system calls.

Go ahead.

Moving spawn calls to child processes

If threads aren't so good, let's factor their work into child processes. We spawn processes for the sake of spawning processes, but we spawn relatively few worker processes from the main thread and distribute the work among them. That way, we only incur the cost of spawning when the main thread starts.

It's pretty simple. Instead of worker threads, we will now have processes spawned by child_process.forkand the process of sending and receiving messages will also change.

$ git diff --unified=1 --no-index ./worker-threads/ ./child-process/
diff --git a/./worker-threads/index.js b/./child-process/index.js
index 52a93fe..0ed206e 100644
--- a/./worker-threads/index.js
+++ b/./child-process/index.js
@@ -3,6 +3,6 @@ import http from "node:http";
 import { EventEmitter } from "node:events";
-import { Worker } from "node:worker_threads";
+import { fork } from "node:child_process";

 const newWorker = () => {
-  const worker = new Worker("./worker-threads/worker.js");
+  const worker = fork("./child-process/worker.js");
   const ee = new EventEmitter();
@@ -21,3 +21,3 @@ const spawnInWorker = async () => {
   // Отправляем и дожидаемся отклика
-  worker.worker.postMessage([id, "echo", "hi"]);
+  worker.worker.send([id, "echo", "hi"]);
   return new Promise((resolve) => {
diff --git a/./worker-threads/worker.js b/./child-process/worker.js
index 5f025ca..9b3fcf5 100644
--- a/./worker-threads/worker.js
+++ b/./child-process/worker.js
@@ -1,5 +1,4 @@
 import { execFile } from "node:child_process";
-import { parentPort } from "node:worker_threads";

-parentPort.on("message", (message) => {
+process.on("message", (message) => {
   const [id, cmd, ...args] = message;
@@ -7,3 +6,3 @@ parentPort.on("message", (message) => {
   execFile(cmd, args, (_error, stdout, _stderr) => {
-    parentPort.postMessage([id, stdout]);
+    process.send([id, stdout]);
   });

Okay. Here are the results:

Language/Runtime

Requests/sec

Team

Node

2 209

node child-process/index.js

Deno

3 800

deno run –allow-all child-process/index.js

Bun

3 871

bun run worker-threads/index.js

There's good speedups across the board, but I'm really curious about the bottleneck that's preventing Deno and Bun from getting up to Rust/Go speeds. Please let me know if you have any ideas on how to figure this out!

What's interesting here is that you can mix Node and Bun. Bun implements the Node IPC protocol, so you can configure Node to spawn Bun child processes. Let's try that.

Let's update the arguments fork so that a binary file is used instead of Node bun.

const worker = fork("./child-process/worker.js", {
  execPath: "/home/maxm/.bun/bin/bun",
});

Language/Runtime

Requests/sec

Team

Node + Bun

3 853

node child-process/index.js

Hah, cool. Now I can use Node on the main thread and still benefit from Bun's performance.

Stdio

Logs. Previous implementations assumed that the output to logs would be minimal, but what if there was a lot of output? We could send logs using process.send, but that would be quite expensive if the output bytes were serialized to JSON.

I spent a lot of time exploring this rabbit hole. Here's a quick summary of what I tried:

  1. Pass file descriptors between processes. For example, pass stdout/err back to the parent process. I tried this a few ways, but I couldn't get all the bytes written to be captured every time.

  2. Just use process.send. This works, but good performance is only achieved with the serialization: “advanced” option, which allows bytes to be sent without serialization. This approach does not work with Deno and Bun.

  3. Tried to create a pair abstract sockets on each spawn call and send logs via socket. This requires too much time to set up sockets to be worth it.

Also, abstract sockets are crazy stuff. I'm familiar with domain socketswhere you have a file called (for example) something.sock. You can listen to it and connect to it just like a network address. It turns out that if you're using a Unix socket, and the name of that socket starts with a null byte, like \0foo, then the socket doesn't exist in the file system and will be automatically deleted if it's no longer in use. Weird! Cool!

Based on all these tests, there were two approaches left, I settled on two approaches that worked quite well.

  1. With help .fork() set up a pool of processes, and also provide a separate abstract socket for each process through which to send logs.

  2. Easy to use process.sendbut with an option serialization: "advanced".

Let's see how all this works in practice.

We need a code that would output a lot of logs. So I took the file main.c from Sqlite sources. The size of this file is 163 Kb. Let's run it with the command cat main.cto get the output to the console.

Let's go back to our file baseline.jsbut after this update:

import { spawn } from "node:child_process";
import http from "node:http";
http
  .createServer((_, res) => spawn("cat", ["main.c"]).stdout.pipe(res))
  .listen(8001);

I also updated the Go and Rust code. Let's see how they are doing:

Language/Runtime

Requests/sec

Team

Node

374

node baseline.js

Deno

667

deno run –allow-all baseline.js

Bun

1 374

bun run baseline.js

Go

2 757

go run go/main.go

Rust (Tokyo)

3 535

cd rust && cargo run –release

Exciting. It's cool to see Bun and Rust pull ahead here compared to previous benchmarks. Node is still very slow, and Deno looks surprisingly bad on this workload.

Next, let's try my implementation, in which communication is carried out via abstract sockets. It turns out to be quite complex, so I will not give it here in full, but you can take a look at it here.

Language/Runtime

Requests/sec

Team

Node

1 336

node child-process-comm-channel/index.js

Node + Bun

2 635

node child-process-comm-channel/index.js

Deno

862

deno run –allow-all child-process-comm-channel/index.js

Bun

1 833

bun child-process-comm-channel/index.js

Haha. I've seen benchmark results where Node+Bun is faster than Bun itself, but that never pans out in the final runs.

Deno produced some mind-bending results. When I implemented this project, there was a “bug” where the response was buffered as a string. Here's how I tried to fix it:

@@ -88,9 +88,8 @@ const spawnInWorker = async (res) => {
   worker.child.send([id, "spawn", ["cat", ["main.c"]]]);
-  let resp = "";
   worker.ee.on(id, (msg, data) => {
     if (msg == MessageType.STDOUT) {
-      resp += data.toString();
+      res.write(data);
     }
     if (msg == MessageType.STDOUT_CLOSE) {
-      res.end(resp);
+      res.end();
       worker.requests -= 1;

With this “editing” Deno slows down a lot, and Node and Bun speed up significantly! I wonder what the matter is, in a faster implementation toString() or that when working with res.write are the costs higher?

Language/Runtime

Requests/sec

Team

Deno + string buffer

1,453

deno run –allow-all child-process-comm-channel/index.js

Strange!

Finally, there is an implementation with process.send. It's fast, and also incredibly easy to implement. I'm not too excited about this solution, since it's slower than I'd like, doesn't support Deno or Bun, and there's very little room for maneuver to fix things. However, it's deeply practical, and easy to understand, which is just beautiful. Here's the source code worker.jsthe rest of the code is located here.

import { spawn } from "node:child_process";
import process from "node:process";

process.on("message", (message) => {
  const [id, cmd, ...args] = message;
  const cp = spawn(cmd, args);
  cp.stdout.on("data", (data) => process.send([id, "stdout", data]));
  cp.stderr.on("data", (data) => process.send([id, "stderr", data]));
  cp.on("close", (code, signal) => process.send([id, "exit", code, signal]));
});

Languages/Runtime Environment

Requests/sec

Team

Node

1 179

node child-process-send-logs/index.js

Very nice and probably the most practical if you're only going to work with Node.

Load balancing

We'll also briefly touch on load balancing between processes. In both Go and Rust complex schedulers are usedwhich are effective distribute tasksUntil now, when choosing a workflow, I took the one that came up first:

const workers = await Promise.all(Array.from({ length: 8 }, newWorker));
const randomWorker = () => workers[Math.floor(Math.random() * workers.length)];

But here you can also implement a carousel mechanism, as well as load balancing in the style of the least number of connections. Here is a great article on this topic.

const pickWorkerInOrder = () => workers[(count += 1) % workers.length];
const pickWorkerWithLeastRequests = () =>
  workers.reduce((selectedWorker, worker) =>
    worker.requests < selectedWorker.requests ? worker : selectedWorker
  );

Unfortunately, even with all these changes, I was unable to consistently improve performance. The performance of all variants is roughly the same. Perhaps such changes would be more useful for more typical workloads, where the spawning calls are not quite evenly distributed.

Library?

Given everything we've discovered, it seems possible to implement a library child_processthe surface of which would represent the same API as that of node:child_processbut it would pull thread creation calls from the thread pool. Maybe I'll write about this someday.

Conclusion

Unfortunately, I have nothing to add given what I know and what experiments have shown. But I am also interested in how to further improve performance.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *