# Choosing a hash function in the data sharding problem

## Introduction

We at Miro are working on the process of sharding Postgres databases and use different approaches depending on business requirements. Recently we were faced with the task of sharding new databases, during which we chose a new approach to sharding for us, based on consistent hashing (*consistent hashing*).

During the implementation of this approach, one of the central questions was which implementation of the non-cryptographic hash function we should choose and use. In this article, I will describe the criteria and the comparison algorithm that we developed and used in practice to find the best implementation.

## About the architectural approach

There are many products (__mongo__, __redis__, etc.) that use consistent hashing for sharding, and our implementation will be very similar to them.

Let, at the input, we have a set of entities with selected sharding keys, of a string type. For these keys, using the hash function, we get a hash code of a certain length, for which we define the required slot through the modulo operation. The number of slots and the correspondence of entities to slots is fixed. It is also necessary to keep the correspondence between the ranges of slots and shards, which is not a difficult task, and a configuration file is quite suitable for the storage location.

The advantages of this approach are:

even distribution of entities across shards;

determining the correspondence of entities and shards without additional storage with a minimum of resource costs;

the ability to add new shards to the cluster.

Cons:

inefficiency of some search operations, in which it is necessary to make queries for all shards;

rather complicated resharding process.

## Requirements

Central to the decision is choosing a java implementation of a hash function.

The function takes as input a key – a string object, up to 256 characters in size, and produces a hash code – an unsigned integer up to 4 bytes. In fact, we will be comparing implementations that generate hash codes of 2 and 4 bytes.

## Comparison criteria

Consider four common criteria for comparing implementations of hash functions:

Speed, the function should work quickly, on any input data;

Distribution of results. It is very important that the function outputs hashes that are uniformly distributed;

~~Resistance to collisions (of the first and second kind);~~~~Compliance with the avalanche effect. Reflects the dependence of all output bits on each input bit, on any input data.~~

For our task, only the first two criteria will be important to us: *first* – since the hash calculation operation will be very frequent; and *second* – because it is extremely important that the data is evenly distributed across shards.

The inability to attack the characteristics of a function makes it unimportant for us *third* criterion.

In case of discrepancy *fourth* criterion, we can get only single outliers from a uniform distribution, which we do not care much about.

## Implementation

We will be looking at the most popular java implementations of non-cryptographic hash functions:

__DJB2__(32-bit);__SDBM__(32-bit);__LoseLose__(32-bit);__FNV-1 / FNV-1a__(32-bit);__CRC16__(16-bit);

## Testing

### Input data

We will use the following datasets as input

A real dataset composed of

__216,553 English words__;A synthetic dataset composed of randomly generated UTF-8 encoded characters.

In both test suites we will have groups of strings with certain lengths (number of characters) – “2”, “4”, “8”, “16”, “32”, “64”, “128”, “256” …

### Metrics

We will use the following metrics to compare different criteria:

For the first criterion, speed – ops / ms (number of operations per millisecond of work);

For the second criterion –

__the fact that the Pearson goodness of fit criterion is satisfied for a uniform distribution__… To do this, we have to introduce a hypothesis about the type of distribution of results and test it. However, such a metric will be binary, and in order to visually assess how close the distribution of hash codes of each of the implementations is to a uniform distribution, we will use the construction of histograms of relative frequencies for each series of tests.

### Tools

#### Estimating the speed of work

To evaluate the speed of work, we will use load tests and the JMH library. The general scheme of the test iteration is as follows:

We will group the words from each test set by length, with a maximum value of 256 characters. Then, in each iteration, we will feed words from each group to the input of the hash function with the same probability.

For benchmarks we will use the following settings

Number of warmup iterations – 50;

Number of measurement-iterations – 100;

Mode –

`throughput`

Add memory limit

`-Xms1G, -Xmx8G`

To estimate memory consumption, add GCProfiler

The full test code can be viewed __here__…

#### Assessment of the distribution of results

To check the correspondence of the output values of the function to our expectations, let us test the hypothesis that the sample of results at the significance level α = 0.05 is distributed according to a uniform law. We will use Pearson’s goodness-of-fit test for verification.

The algorithm for testing the hypothesis is as follows:

We split the sample into partial intervals, the number of which is found by

__Sturges formula__, and we find their length according to the rule of equal-interval grouping;For each interval, let’s calculate its characteristics – average value, frequencies, relative frequencies;

Let’s calculate the sample mean , standard deviation and theoretical frequencies

,

Where

*n*Is the number of elements in the sample, and – the probability of a random variable hitting partial intervals, in our case it is equal to –,

Where – the same length of intervals, and the parameters

*a*and*b –*, ;We can start calculating the consent criterion, according to the formula

,

Where – empirical frequencies obtained from the sample, – theoretical frequencies found by the formulas above;

Determined by the table of critical distribution points , at a given level of significance

*α*and the number of degrees of freedom*k*;If , then we accept the hypothesis, but if this condition is not met, we reject it.

Code for calculating goodness of fit and probabilistic characteristics of samples __here__…

The general scheme of the test iteration is similar to the scheme in the previous section and looks like this:

We will group words from each test set by length, with a maximum character value of 256. Then we create input test samples of different sizes in the range of 16384, 8192, 4096, 2048, 1024, and put words from each group into the samples with equal probability.

All elements of each of the groups will be fed to the input of the hash function and we will receive output samples consisting of integer hash codes. After that, using the algorithm above, we calculate the goodness of fit for them and determine whether it satisfies the hypothesis of a uniform distribution.

The full test code can be viewed __here__…

## results

#### Estimating the speed of work

Let’s consider the speed of work (the number of operations per millisecond) for various implementations, depending on the length of the input lines.

In the range of two to eight characters:

It can be seen that in this range, almost all algorithms work at the same speed, slightly ahead of all ** loseLose**, and the only obvious outsiders are

**and**

*crc16***…**

*sdbm*In the range from 16 to 256 characters:

Function ** murmur2** a clear favorite, she is slightly inferior

**;**

*murmur***and**

*crc16***remained outsiders in this sample as well.**

*sdbm*#### Assessment of the distribution of results

Consider a table of results of compliance with the Pearson criterion

It can be seen that the implementation ** crc16**,

**,**

*murmur2***satisfy Pearson’s uniform distribution criterion in almost all samples.**

*murmur3*Consider the histograms of relative frequencies, in the context of different samples.

On the histograms below, for ** loseLose**,

**,**

*Djb2***that did not pass the test, it can be seen that the distribution is far from uniform and looks more like geometric:**

*Sdbm*For those who fail the test ** Fnv1** and

**the situation is similar, the distributions vaguely resemble normal:**

*Fnv1a*…

We look at the top three winners:

Except for some splashes, ** crc16**,

**,**

*murmur2***satisfy the Pearson criterion, which is consistent with the characteristics of their histograms of relative frequencies.**

*murmur3*## conclusions

Consider the choice of the most appropriate implementation, which we evaluate according to two selected criteria: speed of work and the satisfaction of the hypothesis of a uniform distribution.

**Work speed.** Functions ** murmur2**/

**have the best running time for input strings longer than 8 characters.**

*murmur3***Satisfy the Uniform Distribution Hypothesis.** We can distinguish three functions for which the hypothesis is accepted for most data sets: ** crc16**,

**/**

*murmur2***… The graphs of the distribution of histograms of relative frequencies confirm the form of a uniform distribution for the functions**

*murmur3***,**

*crc16***/**

*murmur2***…**

*murmur3*Thus, based on two criteria, the best choice is the implementations ** murmur2**/

*murmur3.*