Choosing a hash function in the data sharding problem

Introduction

We at Miro are working on the process of sharding Postgres databases and use different approaches depending on business requirements. Recently we were faced with the task of sharding new databases, during which we chose a new approach to sharding for us, based on consistent hashing (consistent hashing).

During the implementation of this approach, one of the central questions was which implementation of the non-cryptographic hash function we should choose and use. In this article, I will describe the criteria and the comparison algorithm that we developed and used in practice to find the best implementation.

About the architectural approach

There are many products (mongo, redis, etc.) that use consistent hashing for sharding, and our implementation will be very similar to them.

Let, at the input, we have a set of entities with selected sharding keys, of a string type. For these keys, using the hash function, we get a hash code of a certain length, for which we define the required slot through the modulo operation. The number of slots and the correspondence of entities to slots is fixed. It is also necessary to keep the correspondence between the ranges of slots and shards, which is not a difficult task, and a configuration file is quite suitable for the storage location.

The advantages of this approach are:

even distribution of entities across shards;
determining the correspondence of entities and shards without additional storage with a minimum of resource costs;
the ability to add new shards to the cluster.

Cons:

inefficiency of some search operations, in which it is necessary to make queries for all shards;
rather complicated resharding process.

Requirements

Central to the decision is choosing a java implementation of a hash function.

The function takes as input a key – a string object, up to 256 characters in size, and produces a hash code – an unsigned integer up to 4 bytes. In fact, we will be comparing implementations that generate hash codes of 2 and 4 bytes.

Comparison criteria

Consider four common criteria for comparing implementations of hash functions:

Speed, the function should work quickly, on any input data;
Distribution of results. It is very important that the function outputs hashes that are uniformly distributed;
~~Resistance to collisions (of the first and second kind);~~
~~Compliance with the avalanche effect. Reflects the dependence of all output bits on each input bit, on any input data.~~

For our task, only the first two criteria will be important to us: first – since the hash calculation operation will be very frequent; and second – because it is extremely important that the data is evenly distributed across shards.

The inability to attack the characteristics of a function makes it unimportant for us third criterion.

In case of discrepancy fourth criterion, we can get only single outliers from a uniform distribution, which we do not care much about.

Implementation

We will be looking at the most popular java implementations of non-cryptographic hash functions:

DJB2 (32-bit);
SDBM (32-bit);
LoseLose (32-bit);
FNV-1 / FNV-1a (32-bit);
CRC16 (16-bit);
Murmur2/Murmur3 (32-bit).

Testing

Input data

We will use the following datasets as input

A real dataset composed of 216,553 English words;
A synthetic dataset composed of randomly generated UTF-8 encoded characters.

In both test suites we will have groups of strings with certain lengths (number of characters) – “2”, “4”, “8”, “16”, “32”, “64”, “128”, “256” …

Metrics

We will use the following metrics to compare different criteria:

For the first criterion, speed – ops / ms (number of operations per millisecond of work);
For the second criterion – the fact that the Pearson goodness of fit criterion is satisfied for a uniform distribution… To do this, we have to introduce a hypothesis about the type of distribution of results and test it. However, such a metric will be binary, and in order to visually assess how close the distribution of hash codes of each of the implementations is to a uniform distribution, we will use the construction of histograms of relative frequencies for each series of tests.

Tools

Estimating the speed of work

To evaluate the speed of work, we will use load tests and the JMH library. The general scheme of the test iteration is as follows:

We will group the words from each test set by length, with a maximum value of 256 characters. Then, in each iteration, we will feed words from each group to the input of the hash function with the same probability.

For benchmarks we will use the following settings

Number of warmup iterations – 50;
Number of measurement-iterations – 100;
Mode – throughput
Add memory limit -Xms1G, -Xmx8G
To estimate memory consumption, add GCProfiler

The full test code can be viewed here…

Assessment of the distribution of results

To check the correspondence of the output values of the function to our expectations, let us test the hypothesis that the sample of results at the significance level α = 0.05 is distributed according to a uniform law. We will use Pearson’s goodness-of-fit test for verification.

The algorithm for testing the hypothesis is as follows:

We split the sample into partial intervals, the number of which is found by Sturges formula, and we find their length according to the rule of equal-interval grouping;
For each interval, let’s calculate its characteristics – average value, frequencies, relative frequencies;
Let’s calculate the sample mean $overline {x_ {b}}$ , standard deviation $sigma_ {b} = sqrt {D_ {b}}$ and theoretical frequencies
$hat {n_ {i}} = np_ {i}$ ,
Where n Is the number of elements in the sample, and $p_ {i}$ – the probability of a random variable hitting partial intervals, in our case it is equal to –
$p_ {i} = frac {x_ {length}} {b - a}$ ,
Where $x_ {length}$ – the same length of intervals, and the parameters a and b – $a = overline {x_ {b}} - sqrt {3 sigma_ {b}}$ , $b = overline {x_ {b}} + sqrt {3 sigma_ {b}}$ ;
We can start calculating the consent criterion, according to the formula
$chi_ {obs} ^ 2 = sum frac {n_ {i} - hat {n_ {i}}} { hat {n_ {i}}}$ ,
Where $n_ {i}$ – empirical frequencies obtained from the sample, $hat {n_ {i}}$ – theoretical frequencies found by the formulas above;
Determined by the table of critical distribution points $chi_ {cr} ^ 2 ( alpha, k)$ , at a given level of significance α and the number of degrees of freedom k ;
If $chi_ {obs} ^ 2 < chi_ {cr} ^ 2$ , then we accept the hypothesis, but if this condition is not met, we reject it.

Code for calculating goodness of fit and probabilistic characteristics of samples here…

The general scheme of the test iteration is similar to the scheme in the previous section and looks like this:

We will group words from each test set by length, with a maximum character value of 256. Then we create input test samples of different sizes in the range of 16384, 8192, 4096, 2048, 1024, and put words from each group into the samples with equal probability.

All elements of each of the groups will be fed to the input of the hash function and we will receive output samples consisting of integer hash codes. After that, using the algorithm above, we calculate the goodness of fit for them and determine whether it satisfies the hypothesis of a uniform distribution.

The full test code can be viewed here…

results

Estimating the speed of work

Let’s consider the speed of work (the number of operations per millisecond) for various implementations, depending on the length of the input lines.

In the range of two to eight characters:

It can be seen that in this range, almost all algorithms work at the same speed, slightly ahead of all loseLose, and the only obvious outsiders are crc16 and sdbm…

In the range from 16 to 256 characters:

Function murmur2 a clear favorite, she is slightly inferior murmur; crc16 and sdbm remained outsiders in this sample as well.

Assessment of the distribution of results

Consider a table of results of compliance with the Pearson criterion

It can be seen that the implementation crc16, murmur2, murmur3 satisfy Pearson’s uniform distribution criterion in almost all samples.

Consider the histograms of relative frequencies, in the context of different samples.

On the histograms below, for loseLose, Djb2, Sdbmthat did not pass the test, it can be seen that the distribution is far from uniform and looks more like geometric:

For those who fail the test Fnv1 and Fnv1a the situation is similar, the distributions vaguely resemble normal:

…

We look at the top three winners:

Except for some splashes, crc16, murmur2, murmur3 satisfy the Pearson criterion, which is consistent with the characteristics of their histograms of relative frequencies.

conclusions

Consider the choice of the most appropriate implementation, which we evaluate according to two selected criteria: speed of work and the satisfaction of the hypothesis of a uniform distribution.

Work speed. Functions murmur2/murmur3 have the best running time for input strings longer than 8 characters.

Satisfy the Uniform Distribution Hypothesis. We can distinguish three functions for which the hypothesis is accepted for most data sets: crc16, murmur2/murmur3… The graphs of the distribution of histograms of relative frequencies confirm the form of a uniform distribution for the functions crc16, murmur2/murmur3…

Thus, based on two criteria, the best choice is the implementations murmur2/murmur3.