Combinators in ClickHouse

As I worked, I often encountered the fact that not all colleagues were familiar with aggregate function combinators in ClickHouse or limited themselves to using the -If combinator. This prompted me to write an article. I hope you find it useful.

The article will not cover all combinators, but if you have questions about any of them, feel free to ask in the comments. The full list can be found at link.

All examples in the article are implemented using built-in ClickHouse tools or using the UK property price dataset. Link for instructions on creating and loading a table, and a detailed description of the fields can be found in source.

Table creation script
CREATE TABLE default.uk_price_paid
(
    `price` UInt32,
    `date` Date,
    `postcode1` LowCardinality(String),
    `postcode2` LowCardinality(String),
    `type` Enum8('other' = 0, 'terraced' = 1, 'semi-detached' = 2, 'detached' = 3, 'flat' = 4),
    `is_new` UInt8,
    `duration` Enum8('unknown' = 0, 'freehold' = 1, 'leasehold' = 2),
    `addr1` String,
    `addr2` String,
    `street` LowCardinality(String),
    `locality` LowCardinality(String),
    `town` LowCardinality(String),
    `district` LowCardinality(String),
    `county` LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (postcode1, postcode2, addr1, addr2)
SETTINGS index_granularity = 8192

-If combinator

Let's start with the simplest, but no less useful combinators in ClickHouse. Such a combinator is -If. In my practice, it is the leader in terms of frequency of use. It allows you to specify an additional condition for the aggregate function. In addition to the field by which we aggregate, we add a condition or we can use a field from the table, provided that it takes the value 1/0.

For example, we want to know the ratio of the average sales price in a particular area of ​​London, say the City, to the average sales price in all of London since the start of 2024.

SELECT avgIf(price, district="CITY OF LONDON") / avg(price) AS ratio_
FROM uk_price_paid
WHERE (date >= '2024-01-01') AND (town = 'LONDON')


   ┌────────────ratio_─┐
1. │ 9.245274507120163 │
   └───────────────────┘

This combinator greatly simplifies writing SQL queries and allows you to avoid using joins and subqueries.

-Distinct

The combinator allows you to select only unique values ​​for the aggregate function.

Script for creating and filling a table
CREATE TABLE random
(
    `column1` UInt8
)
ENGINE = Memory
INSERT INTO 
  random  
SELECT 
  * 
FROM 
  generateRandom(
    'column1  UInt8', -- нейминг столбца и требуемый тип данных
    42 -- random seed
  ) 
LIMIT 10000

Now we have tables with 10,000 random values ​​ranging from 0 to 255.

SELECT
    sum(column1) AS sum_total,
    sumDistinct(column1) AS sum_distinct,
    sumDistinctIf(column1, column1 < 10) AS sum_distinct_cond
FROM random



   ┌─sum_total─┬─sum_distinct─┬─sum_distinct_cond─┐
1. │   1266960 │        32640 │                45 │
   └───────────┴──────────────┴───────────────────┘

We got the sum of the arithmetic progression in the cases sum_distinct and sum_distinct_cond.

State and derived combinators

The following combinators allow you to save and work with the intermediate state of aggregate functions, these are combinators such as -State, -Merge and -MergeState.

Let's start with -State. This combinator allows you to save an aggregate function in an intermediate state, from which you can later obtain the final result. Now, most likely, you have a question, why might this be necessary? There are several situations when this combinator is used:

Inside, AggregatingMergeTree is a table engine from the MergeTree family, which you can read more about link. In short, the main user case of this engine is updating the calculation of metrics as new data appears. The diagram can be seen in the figure below.

In this case, the materialized view acts as a trigger for inserting data, but it should be remembered that increasing the number of materialized views connected to one source table leads to an increase in the time it takes to insert data into the source table.

Another case is space saving (pre-aggregated functions in most cases take up significantly less space compared to the original data) and the ability to build a flexible data mart. Metrics in a pre-aggregated state can also be stored in a regular table from the MergeTree family.

Let's move on to the logic of the State combinator. For each function, the pre-aggregated state content is different: if for the avg function this will be the value of the numerator and denominator, then for the uniq function this will be a hash table.

Example of stored values ​​for sum and avg

Example of stored values ​​for sum and avg

Let's check how the combinator works and how it fits with other combinators using a dataset we generated randomly.

SELECT
    sumState(column1) AS sum_total,
    sumDistinctState(column1) AS sum_distinct,
    sumDistinctIfState(column1, column1 < 10) AS sum_distinct_cond
FROM random

If we execute this query, we will not get the final numbers, but only see binary data. To complete the aggregation, we will need the -Merge combinator or the finalizeAggregation function, and we will get the result as we calculated before.

Request in two versions
SELECT
    sumMerge(sum_total) AS sum_total,
    sumDistinctMerge(sum_distinct) AS sum_distinct,
    sumDistinctIfMerge(sum_distinct_cond) AS sum_distinct_cond
FROM
(
    SELECT
        sumState(column1) AS sum_total,
        sumDistinctState(column1) AS sum_distinct,
        sumDistinctIfState(column1, column1 < 10) AS sum_distinct_cond
    FROM random
)


   ┌─sum_total─┬─sum_distinct─┬─sum_distinct_cond─┐
1. │   1266960 │        32640 │                45 │
   └───────────┴──────────────┴───────────────────┘
; -- или
SELECT
    finalizeAggregation(sum_total) AS sum_total,
    finalizeAggregation(sum_distinct) AS sum_distinct,
    finalizeAggregation(sum_distinct_cond) AS sum_distinct_cond
FROM
(
    SELECT
        sumState(column1) AS sum_total,
        sumDistinctState(column1) AS sum_distinct,
        sumDistinctIfState(column1, column1 < 10) AS sum_distinct_cond
    FROM random
)

   ┌─sum_total─┬─sum_distinct─┬─sum_distinct_cond─┐
1. │   1266960 │        32640 │                45 │
   └───────────┴──────────────┴──────────────────

Also, when designing tables, we will need a function such as toTypeName, which will tell us what type of data our pre-aggregated functions have (AggregateFunction), which will help minimize the likelihood of making a mistake with the data type.

SELECT
    toTypeName(sumState(column1)) AS sum_total,
    toTypeName(sumDistinctState(column1)) AS sum_distinct,
    toTypeName(sumDistinctIfState(column1, column1 < 10)) AS sum_distinct_cond
FROM random

   ┌─sum_total─────────────────────┬─sum_distinct──────────────────────────┬─sum_distinct_cond──────────────────────────────┐
1. │ AggregateFunction(sum, UInt8) │ AggregateFunction(sumDistinct, UInt8) │ AggregateFunction(sumDistinctIf, UInt8, UInt8) │
   └───────────────────────────────┴───────────────────────────────────────┴────────────────────────────────────────────────┘

It is also worth remembering that for aliases such as median and others, the function responsible for the calculation will be used. For example, for median it is quantile. We can see this by running the following query:

SELECT toTypeName(medianState(column1))
FROM random


   ┌─toTypeName(medianState(column1))───┐
1. │ AggregateFunction(quantile, UInt8) │
   └────────────────────────────────────┘

If we talk about the -MergeState combinator, it combines intermediate states of aggregate functions into one. For example, this can be useful when combining State functions of the uniq type.

Comparison demonstration
SELECT uniqMerge(vs)
FROM
(
    SELECT uniqMergeState(column1) AS vs
    FROM
    (
        SELECT uniqState(number) AS column1
        FROM numbers(1, 5)
        UNION ALL
        SELECT uniqState(number) AS column1
        FROM numbers(4, 10)
    )
)

   ┌─uniqMerge(vs)─┐
1. │            13 │
   └───────────────┘

We get the same result as with a simple call to the uniq function.

SELECT uniq(number) AS vs
FROM
(
    SELECT number
    FROM numbers(1, 5)
    UNION ALL
    SELECT number
    FROM numbers(4, 10)
)
   ┌─vs─┐
   │ 13 │
   └────┘

Let's consider a situation where -MergeState might be useful. Let's say we have two tables: one that stores the pre-aggregated timespent of users interacting with video, and one that stores the pre-aggregated timespent of users interacting with audio. We want to get the overall pre-aggregated state for users. We can do avgMergeState over the pre-aggregated timespent and calculate how much time users spend in our product on average, taking into account both audio and video data.

Let's move on to the data presented at the beginning of the article and see how we can save space and build a flexible showcase. First, let's figure out what we want to calculate. Let it be the average price. We also need to select the dimensions by which we want to aggregate the information. This will be the city, region and date, so we get the following request:

SELECT
    date,
    town,
    district,
    avgState(price)
FROM uk_price_paid
GROUP BY
    date,
    town,
    district
Let's define the required data type for the table columns

The previously discussed toTypeName function will help us with this.

SELECT
    toTypeName(date),
    toTypeName(town),
    toTypeName(district),
    toTypeName(avgState)(price)
FROM uk_price_paid
GROUP BY
    date,
    town,
    district
limit 1

Now we need to create a suitable table for this data:

CREATE TABLE default.uk_price_paid_agg
(
    `date` Date,
    `town` LowCardinality(String),
    `district` LowCardinality(String),
    `avg_price` AggregateFunction(avg, UInt32)
)
ENGINE = MergeTree
ORDER BY (town, district, date)
SETTINGS index_granularity = 8192

Let's see how long our table took:

SELECT
    database,
    `table`,
    formatReadableSize(sum(data_compressed_bytes + data_uncompressed_bytes)) AS table_size,
    sum(rows) AS rows
FROM system.parts
WHERE (active = 1) AND (database LIKE '%') AND (`table` LIKE '%')
GROUP BY
    database,
    `table`
ORDER BY table_size DESC


   ┌─database─┬─table─────────────┬─table_size─┬─────rows─┐
1. │ default  │ uk_price_paid_agg │ 129.76 MiB │  6469138 │
2. │ default  │ uk_price_paid     │ 1.02 GiB   │ 29145919 │
   └──────────┴───────────────────┴────────────┴──────────┘

The table takes up almost 8 times less space than the source. Thanks to this showcase, we can flexibly calculate the average price value without accessing the data in the source table, which will speed up the query and allow us to build flexible dashboards using only the aggregated showcase. That is, if we filter data by cities or dates with districts, we will always get the correct average value. This allows us to build dashboards over sources with a large number of dimensions for more convenient visualization.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *