Infrastructure for Data-Engineer ClickHouse

In this article I want to show how you can use ClickHouse in data engineering and how to “touch“.

I would not like to repeat myself, so you can learn most of the information about ClickHouse from the video below:

  • I highly recommend watching this video it covers the entire ClickHouse architecture. Lots of tricks and other subtleties. Do not pay attention that the video is old, it is still relevant, because the ClickHouse architecture has not been changed for many years.

  • Given video I recommend watching after the first one. Here some things are repeated or told differently. But in any case, this video will help you better understand how to work with ClickHouse.

  • And for a deeper understanding of this technology, I recommend watching the video Effective use of ClickHouse / Alexey Milovidov (Yandex)

All further code will be available in my repositories.

Service deployment

First let's create docker-compose.yaml with the following code:

services:  
  
  ch_server:  
    image: clickhouse/clickhouse-server:24.8.4  
    ports:  
      - "8123:8123"  
    environment:  
      CLICKHOUSE_USER: click  
      CLICKHOUSE_PASSWORD: click

Then in the terminal run the command: docker-compose up -d.

And after that you will be able to connect to ClickHouse with the following parameters:

  • host: localhost

  • port: 8123

  • user: click

  • password: click

I will use DBeaver to connect.

But you can use third-party clients to connect.

Previously, ClickHouse supported its client, but now they offer the use of solutions from third-party developers.

Important: try to look for information in English documentation, since documentation in Russian is very far behind in terms of content and updating. In the future, I will provide links only to the EN version of the documentation.

ClickHouse features

Before describing the capabilities, I would like to give a small disclaimer.

ClickHouse is a specific database.
ClickHouse is aimed more at OLAP tasks.

I would like to point out here that in order to call ClickHouse a DBMS it must support “SQL standard“What does this mean?

This suggests that some things in ClickHouse are implemented to support “SQL standard“; for example UPDATE, DELETE.

When you further study ClickHouse, you will more than once see offers like: “not recommended for use UPDATE“. And so it will be on many counts.

But this does not mean that it cannot be used, there are simply some restrictions.

It is also worth noting that for each “standard” ClickHouse has its own implementation of the SQL command. As an example: you should not use “classic“window functions in ClickHouse, but it's worth using arrays and so on.

Columns VS rows

Each task has its own tool. Therefore, for analytical tasks it is worth using databases optimized for OLAP workload.

Speaking with examples, with OLAP and OLTP we have different tasks.

For OLTP tasks, we need to search for data point by point, by a specific field (user, some event, etc.).

With OLAP problems, we want to look at the big picture. Therefore, we do not look at a specific field, but look at a selection of fields (by range, by specific condition, etc.)

There is a good example that I once heard: imagine that you are a forester and you need to find a specific tree, for example a specific birch. You need this to check its status and update it in your database – an OLTP DBMS (PostgreSQL, MySQL, etc) is well suited for such tasks.

And if you want to see the general condition of the forest: how many sick trees, how many birches, fir trees you have, or what is the average height of the forest, then you are already looking at your forest in general, and not at a specific tree, and OLAP DBMSs are suitable for such tasks (ClickHouse, GreenPlum, etc)

And to meet the needs of OLAP loads, ClickHouse uses Column-oriented data storage.

The advantages of this approach are described in the video above and there is also a good one on the ClickHouse website visualizationwhich clearly shows all the advantages of different approaches.

Engines

One of the advantages of ClickHouse is the creation of tables with different engines.

ClickHouse engines help you implement certain logic without writing your own code.

Below we will look at a couple of examples that may occur most often in your practice:

  • MergeTree – the most frequent and most “base“an engine for creating tables in ClickHouse. Provides fast insertion, supports large amounts of data and all the advantages of ClickHouse can be considered on this engine.

  • ReplacingMergeTree – in my opinion, the second most used engine, because it allows “snatch“values ​​by sort key.

    This engine allows you to avoid duplicating values ​​by sort key and keep the table always in “relevant“condition.

    It is immediately worth noting here that “collapse“occurs according to the internal logic of ClickHouse. It can happen instantly or after some time. It all depends on the amount of data inserted and the size of the table itself.

    When working with such a table, you can use the keyword FINAL and then he”will collapse” values ​​during query execution. But this is not recommended because it creates unnecessary load on the database and does not “snatches” values ​​in the table itself.

    You can also perform the operation yourself “collapse” team OPTIMIZE TABLE table_name FINAL. But this is not recommended either. I recommend leaving such operations to ClickHouse itself.

More details about all engines and their capabilities are described in documentation.

In addition to the ClickHouse engines itself, it has a number of engines for working with other DBMS. Don’t forget about them, because thanks to them you can connect to various databases and combine data for your tasks.

MPP

MPP (Massive Parallel Processing) – you can often find this term in various articles/reports/videos that are devoted to data processing. This is a popular approach that allows the computation to be distributed among multiple workers.

Let's sort it out in order. Firstly, it is worth understanding what it is Shard. It is described in more detail here.

Speaking with examples, then Shard allows you to distribute the load. If we take a typical task – to calculate the average bill for the entire period, then one “car“(computer, server) may not be able to handle the load and therefore we can add another Shard into our ClickHouse and now the load when performing analytical queries will be distributed between Shard and will allow you to complete the request faster.

Important: When working with several Shard It is worth choosing the correct sort key (distribution key). If you select the wrong sort key, then the increase Shard will not speed up your queries, and may even slow them down. More subtleties are described in documentation.

It is also worth knowing and understanding what it is Replica. You can find the definition in documentation. But if we simplify, then Replica allows you to make a copy of your data and you can switch to it if you have problems with some Shard.

MATERIALIZED VIEW

Here I would just like to point out that MATERIALIZED VIEW ClickHouse does not work in the most familiar way. If you are used to working with MATERIALIZED VIEW in PostgreSQL, the logic here is slightly different.

Basic operating logic MATERIALIZED VIEW in ClickHouse – this is an automatic update of embedded calculations. Accordingly, if you insert data into “main” table on the basis of which it was made MATERIALIZED VIEWthen after inserting your MATERIALIZED VIEW will be updated automatically.

Described in more detail in documentation.

Dictionary

A powerful tool in ClickHouse that you definitely need to get acquainted with is dictionaries. This is one of the table engines that allows you to create a hash table (key-value) in ClickHouse.

Due to this, the speed of query execution can increase. You can find examples of implementation in documentation.

When creating dictionaries, I would follow the advice of ClickHouse itself:

Use smaller datasets on the right side of JOIN: This query may seem more verbose than is required, with the filtering on PostIds occurring in both the outer and sub queries. This is a performance optimization which ensures the query response time is fast. For optimal performance, always ensure the right side of the JOIN is the set smaller and as small as possible. For tips on optimizing JOIN performance and understanding the algorithms available, we recommend this series of blog articles.

Dictionaries can also be created from external sources (S3, PostgreSQL, etc). Thanks to this feature, you can create dictionaries with “small” and frequently used tables.

We may not transfer all tables from S3/PostgreSQL/etc, but use dictionaries, which will provide us with speed of data retrieval and optimization of the load on the source and ClickHouse.

S3

ClickHouse also provides a good interface for interacting with S3. All S3 integration possibilities in ClickHouse are described in documentation.

If you want to try S3 + ClickHouse, you can use my article Infrastructure for data engineer S3. It will help you integrate S3 into your project.

Kafka

If you don’t know what Kafka is, then I recommend reading my article Infrastructure for data engineer Kafka.

In my opinion, using ClickHouse to integrate Kafka is one of the simplest options. A minimum of actions and the result is visible immediately. An example of Kafka + ClickHouse integration is presented in my article CDC on Primitives.

But if you don’t want to read my materials, but want to delve deeper into the kafka-engine topic in ClickHouse, then you can use the official documentation.

ARRAY

Arrays are a very powerful ClickHouse tool. To write effective queries, they are worth knowing and understanding, as they can speed up the execution of your SQL code significantly.

Good documentation on arrays is available on the ClickHouse website. link.

You can also watch the video Working with arrays in ClickHouse | Maria Somova | karpov.courses. It provides common examples of analytical problems that can be solved using arrays.

Python client

You can interact with ClickHouse through many different clients: Python, Rust, Java, etc.

But I would like to show a couple of examples of how you can work with ClickHouse using Python.

All Python Client documentation is available at link.

First, you need to install the package to work with ClickHouse; to do this, run the command:

pip install clickhouse-connect

After successful installation, let's create a table with the following code:

import clickhouse_connect

client = clickhouse_connect.get_client(
    host="localhost",
    username="click",
    password='click',
)

client.query(
    '''
    CREATE TABLE IF NOT EXISTS test_table 
    (
        id UInt8,
        name String
    ) 
    ENGINE = MergeTree()
    ORDER BY id
    '''
)

And let's load some data into our table test_table:

import clickhouse_connect

client = clickhouse_connect.get_client(
    host="localhost",
    username="click",
    password='click',
)

client.query(
    '''
    INSERT INTO test_table
    VALUES
    (1, 'name1'),
    (2, 'name2'),
    (3, 'name3'),
    (4, 'name4'),
    (5, 'name5'),
    (6, 'name6'),
    (7, 'name7'),
    (8, 'name8'),
    (9, 'name9'),
    (10, 'name10')
    '''
)

And now you can get it from ClickHouse in a format convenient for us. The code below will get the data as pd.DataFrame:

import clickhouse_connect

client = clickhouse_connect.get_client(
    host="localhost",
    username="click",
    password='click',
)

df = client.query_df(
    '''
    SELECT * FROM test_table
    ''',
)

Important: for the method to work correctly query_df you need to install the library pandas command: pip install pandas

We can also get data in the form of tuples:

import clickhouse_connect

client = clickhouse_connect.get_client(
    host="localhost",
    username="click",
    password='click',
)

with client.query_rows_stream('SELECT * FROM test_table') as stream:
    for row in stream:
        print(row)

chDB

In fact, I could write a separate article about chDB, but for now I’ll make do with links to materials from ClickHouse.

All information about chDB is on the ClickHouse website, you can find it at link.

I would say that chDB is an alternative, and sometimes an addition to DuckDB, because it also positions itself:

  • In-process OLAP Engine

  • Without server

  • Easy to install

  • Has different APIs

  • And other points that are described in link.

If you want to learn more about DuckDB, then I have an article on this topic: Everything you need to know about DuckDB.

INSERT DATA

An important feature of ClickHouse is that data must be inserted in batches, and the larger the batch, the better. This is due to the specifics of the database.

If you perform batch inserts and exclude line-by-line insertions, then your ClickHouse will be only too happy.

Why it is necessary to insert in batches is described in more detail in documentation.

Wide tables

It is also worth noting that ClickHouse is not as effective JOINas in other OLAP DBMSs or in the same OLTP.

Therefore, you need to avoid unnecessary JOIN when writing analytical queries.

ClickHouse uses the Denormalizing Data approach; ClickHouse does not implement Data Vault 1, Data Vault 2 or Anchor Modeling. In ClickHouse, your task is to create wide tables for analytics that will cover many analytical queries.

More information about Denormalizing Data is described in documentation.

And after reading about Denormalizing Data it is worth studying Using JOINs in ClickHouse

Resume

ClickHouse is a very powerful tool; even one article is not enough to cover all its capabilities, pros and cons. In this article I wanted to introduce you to him. Tell us a little about a tool that is often found in Data projects.

Don't be afraid of him. ClickHouse is also a very popular tool, so finding information on your question will not be difficult.

It is also worth noting that ClickHouse was developed in Russia and therefore it has a large ru-community. You can join him at TG.

And most importantly, theory without practice is dead, practice without theory is blind. Therefore, try ClickHouse, even on pet projects or within the framework of this article.


Also, if you need advice/mentoring/mock interview and other questions on data engineering, then you can contact me. All contacts are listed at link.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *