Estimating the cardinality of table fields

First, let's determine how cardinality is measured and what kind of cardinality is needed.

The cardinality of a field (or multiple fields) in a table depends not only on the number of unique records in the field, but also on the number of records in the table. Therefore, the cardinality of a table field (or fields) can be measured by the ratio of the number of unique records in the field (fields) to the total number of records in the table.

Next, what cardinality value is desired so that adding GROUP BY reduces query execution time? This depends on the DBMS, but in general you can accept a low cardinality threshold, for example 0.1, i.e. when the cardinality value for a field(s) is from 0 to 0.1, such a field(s) is low cardinality and adding such a field(s) to GROUP BY will reduce the query execution time. Of course, 0.1 can be considered a parameter and changed if necessary depending on the DBMS.

Now that the cardinality and its desired value have been quantified, it is convenient to consider the calculation of cardinality using the example of two cases. We will consider as the “first field of the table” the tens place of a two-digit number (i.e., for example, 1 from the number 12), and as the “second field of the table” the units place of a two-digit number (i.e., for example, 2 from the number 12 ). We will assume that the low cardinality threshold is 0.1, and consider 100 records (i.e., the “number of records in the table” will be 100).

Case 1 – maximum cardinality

100 комбинаций:
00
01
02
03
..
98
99

Cardinality of the tens place: 0.1 (corresponds to the threshold of 0.1), cardinality of the ones place: 0.1 (corresponds to the threshold of 0.1), cardinality of the two places: 100 / 100 = 1 (outside the threshold of 0.1).

When calculating the cardinality of two digits, we can recall placements with repetitions from combinatorics, namely, placement with repetitions from n to k:

\mathrm{\overline{A}}_{n}^{k}={n}^{k}

In our case, for 2 digits and 10 digits, the number of placements from 10 to 2 with repetitions will be:

\mathrm{\overline{A}}_{10}^{2}={10}^{2}=100

Thus, one can once again be convinced from the point of view of combinatorics that even if the cardinality of 0.1 is observed for the digits of tens and units separately, in the presented data the number of unique values ​​of two-digit numbers is maximum and equal to 100, so the cardinality of the two digits is equal to 1.

Case 2 – minimum cardinality

100 комбинаций:
00 (10 раз 00)
00
00
00
00
00
00
00
00
00
11 (10 раз 11)
11
..

99 (10 раз 99)
99

Cardinality of the tens place: 0.1 (corresponds to the threshold of 0.1), cardinality of the ones place: 0.1 (corresponds to the threshold of 0.1), cardinality of the two places: 10 / 100 = 0.1 (corresponds to the threshold of 0.1).

Here you can see that the digits of tens and units are duplicated, there is a linear relationship, the correlation coefficient is 1.

Accordingly, the cardinality of two digits is equal to the cardinality of one digit.

It can be noted that in both cases, the cardinality of the tens and units digits separately is 0.1 and corresponds to the low cardinality threshold, however, the cardinality of two-digit numbers varies from 0.1 (the cardinality of one digit) in the second case to 1 (the maximum possible cardinality) in the first case.

By the way, the cardinality error for two digits when comparing case 1 with case 2 is as much as 1 / 0.1 = 10 = 1000%.

Thus, even for two columns, each of which has a cardinality within the threshold of 0.1, the cardinality of two columns varies from 0.1 to 1, i.e. even for two columns it is possible to get all unique records.

It is also worth noting that the cardinality for a large number of columns is high, and as the number of columns increases, the cardinality tends to 1 (provided that there are no duplicates in the table).

Therefore, the cardinality of several columns can be calculated accurately only by additional processing of the source data (for example, SQL queries with GROUP BY).

Without additional calculations on the source data and without SQL queries, you can also successfully estimate the cardinality, but not in the general case, but in particular ones (which is also useful).

Special cases may be as follows.

  1. There are few unique values ​​in the field (for example, a boolean value of 0 and 1, a year with a cardinality of 1/365, a month with a cardinality of 1/30).

  2. High correlation coefficient (close to 1 or -1, for example, 0.99) – for “duplicate” fields with “almost identical” data, for linear dependencies (for example, year in YY and YYYY format – dependency of the form YYYY = 2000 + YY) . This is also true not only for the case of pairwise correlation, but also for multiple correlation, for example, for the connection of the final price with VAT Price Total, VAT Amount and price excluding VAT Price, i.e. for Price Total = VAT Amount + Price.

As a result, how to estimate the cardinality for a given set of table fields column1, column2, column3, …, columnn without additional queries to the DBMS, and taking into account possible correlations?

We consider that we periodically calculate a correlation matrix with pair correlation coefficients for table fields (for example, in ClickHouse based on corrMatrix). We also periodically calculate approximate values ​​of the number of unique records for each field column1, column2, column3, …, columnn in the table (for example, one of the uniq functions in ClickHouse, i.e. uniq(column1), etc.), denote them as uniq1, uniq2, …, uniqn.

Checking that the fields column1, column2, column3, …, columnn together have low cardinality may involve the following steps.

  • Check that all fields have low cardinality and meet the threshold of 0.1; if at least one field does not, then the cardinality of all fields is not low and stop the analysis.

  • Eliminate columns with correlation coefficients close to 1 or -1 (pairwise and multiple correlations) by a threshold, for example, 0.99, i.e. for pairwise correlation, remove one of two linearly dependent columns; for multiple correlation, remove all dependent columns except one (for example, leave one of three). As a result, we get column1, column2, column3, …, columnm, with m ≤ n.

  • Finally, estimate the cardinality of the remaining columns after removing “duplicates” (in the sense of linearly dependent columns) taking into account the negative scenario (which, for example, corresponds to case 1 discussed above), i.e. maximum cardinality of the remaining columns:

    \frac{uniq(column1)\cdot uniq(column2)\cdot ...\cdot uniq(columnm)}{\text{number of records in the table}}

    Next, compare the obtained value with the low cardinality threshold of 0.1.

As an example, you can calculate the cardinality of the month and year columns for records for 365 days of one year: 1 12 / 365 = 0.03 < 0.1 - low cardinality

It can be seen that you can also add a column with half a year to GROUP BY, and there will still be a low cardinality of 0.03 * 2 = 0.06 < 0.1, or a “duplicate” column with high correlation - for example, “duplicate” years in different formats YY and YYYY - linear relationship between YYYY and YY: YYYY = 2000 + YY.

You can also give an example for the month and week number where the low cardinality threshold is violated: 12 52 / 365 = 1.7 > 0.1

It is also worth noting the features of filtering columns to calculate cardinality based on the multiple correlation coefficient.

For connections of three columns, it is calculated using the formula:

r_{yx1x2}=\sqrt{\frac{r_{x1y}^{2} + r_{x2y}^{2} - 2 \cdot r_{x1y} \cdot r_{x2y} \cdot r_{x1x2}}{ 1 - r_{x1x2} ^{2}}}

In general, it is calculated based on a matrix of paired correlation coefficients of the form:

R=\begin{bmatrix} 1 & r_{x1x2} & r_{x1x3} & ... & r_{x1xn} \\ r_{x1x2} & 1 & r_{x2x3} & ... & r_{x2xn} \ \ r_{x1x3} & r_{x2x3} & 1 & ... & r_{x3xn} \\ ... & ... & ... & ... & ... \\ r_{x1xn} & r_ {x2xn} & r_{x3xn} & ... & 1 \end{bmatrix}

The multiple correlation coefficient is calculated based on the determinant of the correlation matrix |R| And algebraic complement R11 element of the first element of this matrix:

r_{x1x2x3...xn}=\sqrt{1-\frac{\left| R\right|}{R_{11}}}

The calculation of the correlation coefficient itself (pairwise or multiple) does not present any great difficulties, but difficulties may arise with the assessment of cardinality for a relatively large number of fields and for a small number of related fields in multiple correlation (for example, 3).

So, for 10 fields and assessing multiple correlation between 3 fields, the number of combinations of three columns will be equal to the number of combinations of 10 by 3:

\mathrm{C}_{10}^{3}=\frac{10!}{3!\cdot (10-3)!}=120

Thus, searching for a connection between 3 fields among 10 fields “head-on” implies 120 calculations of the multiple correlation coefficient to find a possible combination of 3 linearly dependent columns, which makes sense to take into account from a performance point of view.

I wish you fast queries and successful dashboards 🙂

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *