Counting unique field values ​​in ClickHouse

Interested in solving the problem of finding unique values ​​in ClickHouse? Welcome 🙂

Let's say there are a billion records in a table, and you want to roughly calculate the number of unique records for a column. Since the approximate value is satisfactory, it is clear that I would not want to save all the unique records, because this will waste a lot of memory. It looks like a problem solved using statistical methods.

To illustrate the capabilities of statistical methods, it is interesting to recall, for example, GOST and ISO standards related to selective quality control.

Without going into details of the formulation statistical hypothesischoice statistical criterion and definitions confidence interval taking into account errors of the first and second kindI would like to briefly give an example from GOST for a similar problem, so that it is clear what solution one would like to obtain in the original problem of approximate estimation of the number of unique values ​​in a field.

So, in the example from GOST Appendix D.1 Supplier control Random quality control of a batch of 2120 watches is carried out. Without going into details, for control on the supplier’s side it is enough to check the quality by attribute on a sample of n=239 watches (from the entire batch of 2120 pieces), and to accept the batch it is necessary that the number of watches with deviations does not exceed A=3 with a probability of acceptance P =0.9503 and level of discrepancies qα=0.6%. In other words, with A=3 units of watches with deviations in a sample of n=239 watches from a batch of 2120 pieces, with probability P=0.9503 we can say that in the entire batch the number of defective watches will be qα=0.6%. By the way, in the example Appendix D.2 Consumer control from batch 2120, it is enough to check n=73 units with acceptance number A=4 to decide whether to accept or return a batch of watches with a level of discrepancies qβ8.0% and a probability of at least 1-β=0.80. These examples show that even with small samples, several times smaller than the batch under study, it is possible to successfully solve important problems (in this case, quality control) with a given accuracy and probabilities.

In fact, even if we don’t think about it, statistical methods are used in the production and sale of all goods in the world because of their efficiency, therefore, returning to the original task of approximately counting the number of unique values, I would like to get results in a similar form, i.e. e. with probability and confidence interval (or, for example, relative error).

It seems that in the task of calculating the approximate number of unique elements in ClickHouse, it makes sense to use the condition SAMPLEto take, for example, 10% of the rows of a table using SQL of the form FROM table SAMPLE 0.1, estimate the observed value of some statistical criterion, determine the confidence interval for a given probability using the theoretical value of the criterion, etc.

Fortunately, an approximate determination of the number of unique values ​​in ClickHouse is already implemented in the function uniqTheta. It is worth noting that ClickHouse has other functions for counting unique values, for example, uniq, uniqHLL12But uniqTheta gives results from the point of view of the statistical approach. For example, for a table size of 4096 with a confidence level of 0.95, the relative error is 3.125%, which is given in documentation And theoretical table errors, which is encouraging and speaks of a statistical approach to solving the problem. Moreover, with increasing table size, the relative error only decreases, as can be seen from error tableswhich once again confirms that to obtain a relatively accurate result there is absolutely no need to look through the entire billion records of the table.

Under the hood, uniqTheta uses KVM (or kth Minimum Value).

Scheme of the KWM algorithm for k=3 from https://datasketches.apache.org/docs/Theta/KMVbetterEst.html

You can see that a hash function is used that returns a uniform distribution of Uniform Random Hash from 0 to 1. It is calculated to sample from the table field values. The resulting hash values ​​are sorted. Next, we obtain an estimate of the number of unique records through the distance from 0 to k records (an example is also given for k=3 and distance d=0.195 – corresponds to the figure):

The relative error can be taken from error tables by number of lines (Conf: K) and error (#Std Dev: 2).

The documentation also describes the memory costs for uniqTheta (41 kilobytes).

4096(2^12) 64-bit sketch are used. The size of the state is about 41 KB.

Thus, using uniqTheta we can approximately determine the number of unique records. If necessary, based on the total number of records in the table, check by error tablewhat will be the confidence probability and error.

I would like to note that within the framework of this article it is difficult to definitively recommend this or that ClickHouse function; the article is of a review nature, and the effectiveness of this or that function is tested for each specific case, for example, using a benchmark.

I hope the dive into statistical approaches was interesting, good luck with Big Data processing 🙂

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *