What does ClickHouse do while we sleep or unexpected OOM
Due to certain circumstances, while developing up the career ladder, it became increasingly necessary for me to come into contact with such a well-known tool in the IT world as ClickHouse. Although my positions for the last N-years have been more related to team management, it is very important for me to understand the essence of technological solutions and the problems that my teams face. Despite all its friendliness and design flexibility, ClickHouse sometimes looks like a very capricious technology. One of these features that I happened to encounter was the fight against sudden OOM. In this article, together with you, we will try to consider the reasons why this sudden OOM, in fact, can come from.
ClickHouse is one of the most famous and efficient column-based analytics databases available for use in self-hosted mode. In addition, it is open source (unlike the same Vertica). Standard ClickHouse configurations may well have from 128 to 256 GB of RAM. Nevertheless, OOM, or Out of Memory exception – lack of RAM to execute a request, is a fairly common error when working with this database. Of course, quite often the reason lies in the suboptimal schema of this table, an inefficient query, or the settings of ClickHouse itself. In this article, we will take a step further and talk about the background processes in ClickHouse that are necessary for its operation and at the same time consume precious RAM.
What are the background processes?
Yes, ClickHouse works not only when a request comes in, but in general when it pleases! We are used to the fact that, say, in PostgreSQL, the increase in resource consumption follows the request that caused it and, after sending the response, ends. Even operations like
REINDEX are also performed on request, manually. WAL replication, avtovacuum, and the like usually don’t affect performance, or at least these processes don’t cause spikes in disk, memory, or processor usage.
At the same time, ClickHouse can launch a set of background tasks at a random time, using resources comparable to a regular analytical query. In my experience, there have been two types of cases where this brings problems:
one) “my analytical query, which was always working out, suddenly fell in OOM, although nothing really changed in the table”;
and a little more complex:
2) “a uniform sharding key was chosen, but one of the shards is constantly underloaded”. To answer this question, let’s briefly discuss how ClickHouse stores data.
How does ClickHouse store data?
First, it’s still worth mentioning that ClickHouse is a column database, and grouping records by columns allows you to reduce the number of random access reads in aggregation queries.
Secondly, this database is designed for high throughput for adding new records. Let’s dwell on this point in more detail.
If in relational databases during
INSERT If a transaction occurs, the record immediately goes to disk, then ClickHouse basically has no transactions, and the record first goes to RAM. Thus, incoming records live in the mempool for some time before being sent as a batch to a slow disk. This approach deprives us of the D (Durability) property from the ACID principles (Atomicy, Сonsistency, Isolation, Durability) that are in relational databases. It is worth noting that ClickHouse began to support
WAL+fsync, but large companies in the IT world are characterized by a long transition to new versions of the product. Those. in case of abnormal shutdown – data from the mempool may be lost, but this reduces the number of random access writes to disk.
Batches from the mempool are then grouped by keys (columns) with other batches into larger groups, and so on. To understand which group the batch will go to, the probabilistic data structure is calculated in the group – bloom filter with the keys it contains. Bloom filters can have false positives, but not false negatives. Therefore, for example, it is difficult to check if a key is missing in a group, but ClickHouse does not need it.
And finally, the batch groups themselves are also combined into larger groups and sorted by size into layers. It is this merge process that probably gave the name to the MergeTree engine in ClicklHouse.
The main idea behind the engines of the MergeTree family is the following: if you have a huge amount of data that needs to be inserted into a table, we should be able to quickly write it down in parts, and then merge the parts according to some rules in the background. Thus, this method is much more efficient than endlessly overwriting data in storage on insertion.
Merging large groups can use a significant amount of resources, but the larger the group, the less often such mergers occur.
These mergers just happen in the background. At some arbitrary point in time, when ClickHouse decides that it is time to write the accumulated mempool to disk, and the new batch will trigger a chain of merges of large groups, we can get an unexpected load peak.
At the same time, in a distributed table
ENGINE=Distributed still more interesting. Mergers can cause uneven shard loading even with a good shard key. The fact is that the standard balancer from the open-source version of ClickHouse does not take into account the loading of shards and sends inserted records to the first available one. But if a shard merges, its throughput drops and throughput degrades. But we will talk more about this another time.