From love to hate with process.send

Hello everyone, I am the creator of https://vatsim-radar.com/ and today I almost died.

In general, here's the deal. We are a map of virtual airplanes. We have recently been promoted at the official level, and now we serve thousands of people daily – in the basement when you open the site, it will show how many are sitting there right now (“in Radar”).

Previously, we were updating every 15 seconds with a real delay from the game of about 30 seconds. All this was cached on Cloudflare, and we lived perfectly – until at some point we were given updates on events with almost no delays.

I decided to use websockets and update data as soon as it came in. Everything was going smoothly, until that PR thing happened – and we started giving out terabytes of data per day. I reduced the traffic to a manageable 300 GB per day, but it's still too much – in the future we plan to move to foreign hosting, and most of them have traffic limits, which is why we'll pay more for traffic than for the servers themselves.

I decided that something had to be done about this, and after unsuccessfully trying to migrate sockets to CloudFlare Workers + Durable Objects (still too expensive), I decided to give up and ditch sockets in favor of good old CF cache, but continue to run heavily scaled down data with a very small cache.

The result speaks for itself:

See what happened after the 3pm surge

See what happened after the 3pm surge

Now we are consuming ~72GB per day at peak, and we can work with that. But, of course, the rejection of sockets increased the delay in updating.

Our data processing script for unloading the main flow is moved to a separate one via fork, and with each update it sends data to the parent via process.send, which then distributes it via API.

The script processed data for 3 seconds, which, taking into account the cache, created a delay of up to 9 seconds, which did not suit me. I decided to optimize the script, and as a result, I achieved processing once per second, which was already better (commentators are asked to accept the pursuit of delay and seconds as a given).

The delay has decreased by 3 times, but…

A bug appeared on the production. At some point, updates… just slowed down. During the initial deployment, everything was fine, after which users saw a delay of a minute… two… ten… And the worker's RAM on the server grew at a terrible rate.

I think, well, I see. I created a memory leak along the way. I looked through the entire code, attached a debugger to the worker – I can't find anything. Locally everything is fine, there are no delays, after deployment there are. I started digging into the CF cache, climbing directly – the same thing. Nonsense.

Covered the code with logs, deployed to the next stand. And to my surprise…

Everything is perfect. Works like clockwork. No delays. However, users do not receive this data, and it gets worse with each passing minute.

There are no errors in the logs. Neither in the child nor in the parent. The worker's memory itself does not grow. The processes do not grow. Nothing in the code.

And then it dawns on me.

If the worker is updated, but the parent is not, what does this mean?

That data does not reach the parent in principle.

Work process.send

I admit honestly, in my entire career I have never used spawning of child node processes normally, not counting one-time ones. And even more so, I have never communicated via process.send.

I went to google – maybe it doesn't run synchronously? I found this

Ishyu was then fixed, and functionality was added that allows it to be made “asynchronous”. I think it's strange, the bug is not about that at all, oh well. I open the documentation.

Nothing about synchronicity. There is no description of what sendHandle does, nor whether this method needs a callback in 2024.

I decided to check the work of the unfortunate callback. I look locally, and indeed: the script ends before the callback is triggered, but only by 10 milliseconds.

Having decided to make an allowance for the fact that our cloud has crappy disks, plus there are two instances running on one machine, I rolled out a wrapper with a promise for process.send on the same next.

And – bam – the leaks stopped, as if they had never been there.

await new Promise<void>((resolve, reject) => {
  process.send!(JSON.stringify(radarStorage.vatsim), undefined, undefined, err => {
    if (err) return reject(err);
    resolve();
  });
});

How did I manage to get this? Why didn't this happen before?

  1. Sending too much data to process.send

  2. Migrate to the cloud where the disk is slower

  3. Speeding up the script from 1 to 3 seconds, which no longer gives time to execute the previous process.send

What is the purpose of this article?

While googling I couldn't find anything about this problem with process.send.

The problem is very difficult to create if you communicate with small data, and not very often. However, if it occurs, it is extremely difficult to google it and guess the cause from the Node documentation.

This error drove me crazy: I changed a lot in the code, and I was sure that either I was creating a memory leak, or something was wrong with the new hosting, or Cloudflare was doing something wrong. The strangest thing was the time of the last update: it was constantly jumping and seemed to remain in the past, while the whole server worked the same as before.

Sorry for such a rambling article about a problem with just one method, and hopefully it will help someone scouring the internet for his weird problem.

Thank you for your attention =)

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *