Hash + cache: stream processing optimization

What should I do if I want to write down a lot of “facts” in the database of a much larger volume than it can withstand? First, of course, we bring the data to a more economical normal form and get the “dictionaries” in which we will write once. But how to do it most efficiently?

This is exactly the question we faced when developing monitoring and analysis of PostgreSQL server logs, when other methods of optimizing the record in the database were exhausted.

Immediately make a reservation that our collectors are running Node.jsTherefore, we do not interact with processor registers and caches. And the option of using “hundred” or external caching services / databases gives too much delay when incoming streams of several hundred Mbps.

Therefore we try cache everything in RAM, specifically, in the memory of the JavaScript process. About how to organize this more efficiently, and we will go further.

Availability Caching

Our main task is to make sure that the only instance of any object gets into the database. These are repeatedly repeated original texts of SQL queries, templates of plans for their implementation, nodes of these plans – in short, some text blocks.

Historically, as an identifier, we used UUID-value obtained as a result of direct calculation MD5 hash from the text of the object. After that, we check for the presence of such a hash in the local “Dictionary” in the process memory, and if it’s not there, only then we write in the database in the “dictionary” table.

That is, the original text value itself is not required for us to be stored (and sometimes it takes tens of kilobytes) – just the most the fact of the corresponding hash in dictionary.

Key Dictionary

Such a dictionary can be kept in Array, and use Array.includes() to check for availability, but this is redundant enough – the search degrades (at least in previous versions of V8) linearly from the size of the array, O (N). And in modern implementations, despite all the optimizations, it loses at a speed of 2-3%.

Therefore, in the era before ES6, storage was a traditional solution. Objectwhose keys are stored values. But everyone assigned what they wanted with key values ​​- for example, Boolean:

var dict = {};

function has(key) {
  return dict[key] !== undefined;

function add(key) {
  dict[key] = true;

But it’s quite obvious that we are clearly storing the excess here – the very value of the key that no one needs. But what if it is not stored at all? And so it appeared set object.

Tests show that search with Set.has() approximately 20-25% faster than key verification in Object. But this is not his only advantage. Since we store less, then we need less memory – And this directly affects performance when it comes to hundreds of thousands of such keys.

So, Object, which contains 100 UUID keys in a text representation, is in memory 6216 bytes:

Set with the same content – 2632 bytes:

I.e Set faster and at the same time takes 2.5 times less memory – the winner is obvious.

We optimize storage of UUID keys

In general, in the nature of distributed systems, UUID keys are quite common – in our VLSI at a minimum, they are used to identify documents and regulations in electronic document management, persons in messaging, …

Now let’s take a closer look at the picture above – each UUID key stored in the hex representation is “worth” to us 56 bytes of memory. But we have hundreds of thousands of them, so it’s reasonable to ask: “Is it possible to have less?”

First, recall that the UUID is a 16-byte identifier. Essentially a piece of binary data. And for transmission by email, for example, binary data is encoded in base64 – try to apply it:

let str = Buffer.from(uuidstr, 'hex').toString('base64');

Already 48 bytes each is better, but imperfect. Let’s try translating the hexadecimal representation directly into a string:

let str = Buffer.from(uuidstr, 'hex').toString('binary');

Instead of 56 bytes per key – 40 bytes, almost 30% savings!

Master, worker – where to store dictionaries?

Given that the vocabulary data from the workers intersect quite strongly, we made the storage of dictionaries and writing them to the database in the master process, and the transfer of data from the workers through IPC messaging engine.

However, a significant portion of the master’s time was spent on channel.onread – that is, the processing of receiving packets with “dictionary” information from child processes:

Dual Set Write Barrier

Now let’s think for a second – the workers send and send the master the same vocabulary data (basically these are the plan templates and the repeated request bodies), he parses them in the sweat of his face and … does nothing, because they have already been sent to the database before !

So if we Set– the dictionary “protected” the base from re-recording from the master, why not apply the same approach to “protect” the master from being transferred from the worker? ..

Actually, that was done, and tripled direct costs exchange channel service:

But now the workers are doing a bit more work — storing dictionaries and filtering by them? Or not? .. In fact, they began to work significantly less, since the transfer of large volumes (even via IPC!) Is not cheap.

Nice bonus

Since the wizard now began to receive a much smaller amount of information, it began to allocate much less memory for these containers – which means that the time spent on the Garbage Collector’s work decreased significantly, which positively affected the latency of the system as a whole.

Such a scheme provides protection against repeated entries at the collector level, but what if we have several collectors? Only a trigger with INSERT ... ON CONFLICT DO NOTHING.

Speed ​​up hash calculation

In our architecture, the entire log stream from one PostgreSQL server is processed by one worker.

That is, one server is one task for the worker. At the same time, the loading of workers is balanced by the purpose of the server tasks so that the CPU consumption by the workers of all collectors is approximately the same. This is a separate service dispatcher.

“On average,” each worker handles dozens of tasks that produce approximately the same total load. However, there are servers that significantly surpass the rest in the number of entries in the log. And even if the dispatcher leaves this task the only one on the worker, its download is much higher than the others:

We removed the CPU profile of this worker:


On the top lines is the calculation of MD5 hashes. And they are really calculated a huge amount – for the entire stream of incoming objects.


How to optimize this part, except for these hashes, we can not?

We decided to try another hash function – xxHashimplementing Extremely fast non-cryptographic hash algorithm. And the module for Node.js is xxhash-addon, which uses the latest version of xxHash 0.7.3 with the new XXH3 algorithm.

Check by running each option on a set of rows of different lengths:

const crypto = require('crypto');
const { XXHash3, XXHash64 } = require('xxhash-addon');
const hasher3 = new XXHash3(0xDEADBEAF);
const hasher64 = new XXHash64(0xDEADBEAF);

const buf = Buffer.allocUnsafe(16);
const getBinFromHash = (hash) => buf.fill(hash, 'hex').toString('binary');

const funcs = {
  xxhash64 : (str) => hasher64.hash(Buffer.from(str)).toString('binary')
, xxhash3  : (str) => hasher3.hash(Buffer.from(str)).toString('binary')
, md5      : (str) => getBinFromHash(crypto.createHash('md5').update(str).digest('hex'))

const check = (hash) => {
  let log = [];
  let cnt = 10000;
  while (cnt--) log.push(crypto.randomBytes(cnt).toString('hex'));




xxhash64 : 148.268ms
xxhash3  : 108.337ms
md5      : 317.584ms

Like was expected, xxhash3 turned out to be much faster MD5!

It remains to check for resistance to collisions. Sections of dictionary tables are being created for us every day, so outside the boundaries of the day we can safely allow the hash intersection.
But just in case, they checked with a margin in the interval of three days – not a single collisionthat more than suits us.

Hash Replacement

But we simply cannot take and exchange old UUID fields in the dictionary tables for a new hash, because both the database and the existing frontend wait for objects to continue to be identified by UUID.

Therefore, we add in the collector another cache – for already calculated MD5. Now it will be Mapwhere the keys are xxhash3, the values ​​are MD5. For identical lines, we do not recalculate the “expensive” MD5 again, but take it from the cache:

const getHashFromBin = (bin) => Buffer.from(bin, 'binary').toString('hex');
const dictmd5 = new Map();
const getmd5 = (data) => {
  const hash = xxhash(data);
  let md5hash = dictmd5.get(hash);
  if (!md5hash) {
    md5hash = md5(data);
    dictmd5.set(hash, getBinFromHash(md5hash));
    return md5hash;
  return getHashFromBin(md5hash);

We remove the profile – the fraction of the time for computing hashes has decreased markedly, cheers!

So now we count xxhash3, then check the MD5 cache and get the desired MD5, and then check the dictionary cache – if this md5 is not there, then send it to the database for writing.

Something too many checks … Why check the dictionary cache if you have already checked the MD5 cache? It turns out that all dictionary caches are no longer needed and it is enough to have only one cache – for MD5, with which all the basic operations will be performed:

As a result, we replaced the check in several caches of “object” dictionaries with one MD5 cache, and the resource-intensive operation of calculating the MD5 hash is performed only for new entries, using a much more efficient xxhash for the incoming stream.

Thanks to Kilor for helping with this article.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *