replaced the SSH2 module with OpenSSH and reduced the eventloop delays by 15 times


Why SSH?

We need to execute shell and sql commands on PostgreSQL servers, such as reading a log file, removing statistics, finding locks. Console access on most servers is already implemented via SSH, and access to PostgreSQL instances is not so easy – you need to establish new connections to all instances, and to do this, open network ports and manage pg_hba.conf configs by setting the monitoring server IP addresses in them, Yes, and transmitting data over the network in the clear is not good, but for SSL also need separate settings.

Therefore, it is logical to perform all operations through SSH, using the ability to launch multiple sessions through a single connection.

Working with the SSH2 module

All ssh traffic goes through the ssh2 module

All ssh traffic goes through the ssh2 module

Architectural solutions are described in detail in this article and in the sequel.

When using the module ssh2 and a large load, an unpleasant side effect may occur due to the fact that the module is written in Javascript, which means it runs on the main thread along with the rest of the application code. In addition to increasing delays in the event loop, this additionally loads the GC and thread pool.

On some servers, event loop delays during peak hours grew to indecent values:

ssh2 module receiving data results in high latency

ssh2 module receiving data results in high latency

To solve this problem, we decided to offload all ssh operations into a separate process, but instead of the ssh2 module, use the console ssh client from the package OpenSSH.

Trying to switch to OpenSSH

This ssh client allows you to perform operations in single connection sharing mode. To do this, first establish a master connection and create a control socket file /tmp/ssh.sock to interact with slave processes:

ssh -M -S /tmp/ssh.sock -i id_rsa -l username pg_hostname

and then we start the process in slave mode, for example, to execute console commands:

ssh -S /tmp/ssh.sock pg_hostname command

or to set up a tunnel and forward a local socket connection to a remote host:

ssh -S /tmp/ssh.sock -O forward -L /tmp/postgresql.sock:127.0.0.1:5432 pg_hostname

We start the ssh process with child_process.spawn and the incoming data stream is obtained from its stdout in the form stream Readable.

For unification created a new module system-ssh with the same methods and parameters as ssh2 – connect, end, exec , forwardOut and additionally forwardOutLocalSocket, and published it in the npm registry.

Scheme of work with the new module:

Removed ssh from the main process

Removed ssh from the main process

Thus, in the main process, only the function of launching the child process remained, in which all operations for servicing ssh connections are carried out.

With a small number of ssh connections, this option is quite efficient.

But we have about 2000 PostgreSQL instances in monitoring, and it is not rational to launch such a number of child processes. To unload the event loop of the main process, it is enough to transfer only servers with a relatively large data flow to the new model. As a filter for transferring to system-ssh, we set the limit to 1.25 Mbps, and to return to the old inproc model – 0.75 Mpbs. This keeps the number of child processes small, while most of the traffic is handled outside the main process:

only servers with high ssh traffic work on the new model

only servers with high ssh traffic work on the new model

That’s just when you start spawn in Node.JS appear delays due to synchronous operations with heap, and the more memory is used by the process, the more they are:

On Unix-like operating systems, the child_process.spawn() method performs memory operations synchronously before decoupling the event loop from the child. Applications with a large memory footprint may find frequent child_process.spawn() calls to be a bottleneck. For more information, see V8 issue 7381.

In our case, the worker process keeps various caches in memory and servers are often transferred between workers for balancing, so the impact of spawn delays is quite large.

SSH proxy

To solve this problem, we moved the launch of all console ssh to a separate proxy process and, as a plus, got the ability to use a single ssh connection for all PostgreSQL instances on the server.

Moved the launch of ssh to a proxy process with a connection via net.Socket

Moved the launch of ssh to a proxy process with a connection via net.Socket

The data stream from the process stdout was redirected to Unix domain socketlike that:

Sample code for passing via socket
// sshproxy
const { Client } = require('system-ssh');
const sshConnection = new Client();
const socketPath="/tmp/host_1/stream_1.sock";

sshConnection.exec('tail -F postgresql.log', (error, stream) => {
    const server = net.createServer({noDelay: true}, (socket) => {
        stream.pipe(socket).pipe(stream);
    })
    server.on('error', (err) => {
        if (err.code === 'EADDRINUSE') {
            // сокет уже есть, пробуем подключиться к нему
            const clientSocket = new net.Socket();
            clientSocket.on('error', (clientError) => {
                if (clientError.code === 'ECONNREFUSED') {
                    // сокет никто не использует, тогда удаляем и пробуем снова
                    fs.unlinkSync(socketPath);
                    server.listen(socketPath);
                }
            });
            clientSocket.connect({path: socketPath}, () => {
                // подключились, значит кто-то уже использует этот сокет
                clientSocket.destroy();
                server.close();
            });
        }
    })
    server.on('listening', () => {
        // сокет готов, можно подключаться со стороны воркера
    })
    server.listen(socketPath);
})

// worker
const clientSocket = new net.Socket();
clientSocket.connect({path: socketPath}, () => {
    // подключились к сокету
    clientSocket.on('data', (data) => {
        // обрабатываем поступившие данные
        // или просто clientSocket.pipe(dataHandler)
    })
})

And we got a significant increase in delays on the worker.

It looks like net.Socket is not suitable for us here, we are trying to replace it with named pipe (FIFO):

replaced net.Socket with FIFO

replaced net.Socket with FIFO

The code for FIFO is simpler, in sshproxy we create on top of it writing stream, and in the worker – reading

Code example for FIFO
// sshproxy
const { exec } = require('child_process');
const fs = require('fs');
const { Client } = require('system-ssh');
const sshConnection = new Client();

const fifo = '/tmp/host_1/stream_1.fifo'
const cmdFifo = exec(`test -p ${fifo} || mkfifo ${fifo}`, (error, stdout, stderr) => {
    fs.open(fifo, 'w', (error, fd) => {
        sshConnection.exec('tail -F postgresql.log', {stdio: ['pipe', fd, 'pipe']}, (error, stream) => {
            const fifoStream = fs.createWriteStream(fifo, {fd});
            stream.pipe(fifoStream);
        })
    })
})

// worker
let stream = fs.createReadStream(fifo);
stream.on('data', (data) => {
    // обрабатываем поступившие данные
    // или просто stream.pipe(dataHandler)
})

And we get a 15-fold reduction in latency on loaded workers:

Worker latency has decreased

Worker latency has decreased

and as a result – the almost complete absence of write queues in the database:

DB write queues

DB write queues

UV_THREADPOOL

As usual, it was not without a fly in the ointment, in this case, when working with FIFO, one must take into account that all asynchronous file operations in Node.JS are performed in uv_threadpool and its size is limited and defaults to 4. If all 4 threads in this pool are busy, then other operations will hang in the queue.

Considering that dns.lookup() operations and asynchronous crypto APIs are performed in the same pool, which we have quite a lot in the worker (since most of the servers remained working according to the old scheme with the ssh2 module), with a large number of loaded FIFOs, you can get performance degradation.

The reason for this is also that the file wrap on top of the FIFO performs blocking operations, for example fs.open(fifo) will wait and occupy the uv_threadpool thread until the FIFO is opened on the other side. Exit, as they say Hereit could be the use of non-blocking I / O, but the fs module in Node.JS does not know how, and the one that can – net.Socket does not suit us.

In our case, one sshproxy serves 20-30 PostgreSQL servers simultaneously and there is no negative impact on performance.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *