How to create and use dictionaries in ClickHouse

1 min


If you have opened this a door article, then for sure you have already dealt with ClickHouse and you can miss interesting details about its convenience and speed, and go straight to the point – in fact, how to create dictionaries and work with them in ClickHouse.

What are dictionaries in ClickHouse?

Dictionary is a display of data in the form keyvalue The advent of dictionaries has greatly simplified the use of third-party data sources in ClickHouse, automating all the necessary ETL processes to deliver data in a queryable form.
There are several points to highlight the advantages of using dictionaries in ClickHouse:

  • ClickHouse has support for various options for the location of dictionaries in memory.
  • Support TTL – ClickHouse automatically updates dictionaries and loads missing values.
  • ClickHouse provides several options for describing external dictionaries – XML ​​files and DDL queries.

Connecting dictionaries

You can connect your own dictionaries from various data sources: a local text / executable file, an HTTP (s) resource, another DBMS, etc.

The configuration of these dictionaries can be in one or several xml files, the path to which is specified in the parameter dictionaries_config in the ClickHouse configuration file.

Dictionaries can be loaded at server start or at first use, depending on the setting dictionaries_lazy_load

Also, updating dictionaries (except for loading on first use) does not block queries – during the update, queries use the old version of the dictionaries.

To view information about the dictionaries configured on the server, there is a table system.dictionaries, in it you can find:

  • dictionary status;
  • configuration parameters;
  • metrics such as the amount of RAM occupied by the dictionary or the number of requests to the dictionary since it was successfully loaded.

Dictionary configuration

At the moment there is a way to configure dictionaries through xml files and through DDL queries. You can use any method you like, but the easiest way to create and control dictionaries is using DDL queries.

General appearance of the xml dictionary configuration:


    
    Some comments

    
    /etc/metrika.xml

    
        
    

    ...

    
        
    

If you chose to create dictionaries through DDL queries, then do not set the dictionary configuration in the server configuration.

Example dictionary configuration:


    clients
    
        
            myHostName
            9000
            admin
            secret_password
            clients
            users
id<=10
3600 5400 user_id username string age Int8

Setting fields:

  • name – the name of the dictionary;
  • source – the source of the dictionary;
  • lifetime – the frequency of updating dictionaries;
  • layout – placement of the dictionary in memory. The speed of dictionary processing depends on this value;
  • structure – the structure of the dictionary. Key and attributes that can be obtained by key.

An example of creating a dictionary using a DDL query:

CREATE DICTIONARY dict_users_id (
    id UInt64,
    username String,
    email String,
    status UInt16,
    hash String
)
PRIMARY KEY id
SOURCE(MYSQL(
    port 3306
    user clickhouse
    password secret_password
    replica(host 'mysql1.fevlake.com' priority 1)
    db fevlake_dicts
    table users
))
LAYOUT(HASHED())
LIFETIME(MIN 3600 MAX 5400);

Sources of external dictionaries

External dictionaries can be connected through many different sources. The main ones are:

  • Local file
  • Executable file
  • HTTP (s)
  • DBMS

The most common ways to connect dictionaries are through a local file or a DBMS, so we will consider them below.

Local file

An example of connecting a dictionary via a local file is as follows:


    
      /opt/dictionaries/clients.csv
      CSV
    

Setting fields:

  • path Is the absolute path to the file.
  • format – file format. All ClickHouse formats are supported.

Or via DDL request:

SOURCE(FILE(path '/opt/dictionaries/clients.csv' format 'CSV'))
SETTINGS(format_csv_allow_single_quotes = 0)

DBMS

Let’s consider connecting a DBMS using a MySQL database as an example.

Setting example:


    
        3306
        clickhouse
        secret_password
        
            example01-1
            1
        
        
            example01-2
            1
        
        db_name
        table_name
id=10 SQL_QUERY

  • port – MySQL server port. Can be set separately for each replica inside the tag
  • user – MySQL username. Can be set separately for each replica inside the tag
  • password – MySQL user password. Can be set separately for each replica inside the tag
  • replica – replica configuration block. There can be several blocks.
  • db – the name of the database.
  • table – table name.
  • where – selection condition. The syntax is exactly the same as the syntax of the section WHERE in MySQL, for example, id >= 3 AND id < 10 (optional parameter).
  • invalidate_query - request to check the status of the dictionary (optional).

Or via DDL request:

SOURCE(MYSQL(
    port 3306
    user clickhouse
    password secret_password
    replica(host 'mysql1.fevlake.com' priority 1)
    db fevlake_dicts
    table users
))

Storing dictionaries in memory

There are many ways to store dictionaries in ClickHouse memory:

  • flat
  • hashed
  • sparse_hashed
  • cache
  • direct
  • range_hashed
  • complex_key_hashed
  • complex_key_cache
  • complex_key_direct
  • ip_trie

The most popular of them are only 3, since the processing speed of dictionaries is maximum, this is flat, hashed and complex_key_hashed... Let's take a look at examples of these storage methods.

Flat

Dictionaries are completely stored in RAM as flat arrays, with the amount of memory used being proportional to the size of the dictionary's largest key. The dictionary key must be of type UInt64 and must not be longer than 500,000, otherwise ClickHouse will throw an exception and not create a dictionary.

This storage method provides the best performance of any dictionary storage method available.

Configuration example:


    

or

LAYOUT(FLAT())

Hashed

The dictionary is completely stored in RAM in the form of hash tables and can contain an arbitrary number of elements with arbitrary identifiers. In practice, the number of keys can reach tens of millions of elements.

Configuration example:


    

or

LAYOUT(HASHED())

Сomplex_key_hashed

This type of placement is intended for use with composite keys. Is similar hashed way.

Configuration example:


    

or

LAYOUT(COMPLEX_KEY_HASHED())

Dictionary key and fields

Section describes the dictionary key and fields available for queries.

XML description:


    user_id
    
        username
        string
    
    
        age
        Int8
    

Setting fields:

  • - a column with a key;
  • - data column. Several attributes can be set.

Keys

ClickHouse supports the following types of keys:

  • Numeric key. UInt64... Described in the tag or by keyword PRIMARY KEY...
  • Composite key. A set of values ​​of different types. Described in the tag or by keyword PRIMARY KEY...

Numeric Key

A type: UInt64...

Configuration example:


    user_id

or

CREATE DICTIONARY (
    user_id UInt64,
    ...
)
PRIMARY KEY user_id
...

  • PRIMARY KEY - the name of the column with keys.

Composite key

The key can be a tuple (tuple) from fields of arbitrary types. In this case layout it should be complex_key_hashed or complex_key_cache...

The key structure is specified in the element ... Key fields are specified in the same format as dictionary attributes. Example:


    
        field1
        String
    
    
        field2
        UInt32
    
    ...

or

CREATE DICTIONARY ( field1 String, field2 String ... )
PRIMARY KEY field1, field2
...

Attributes


    ...
    
        Name
        ClickHouseDataType
        
        rand64()
        true
        true
        true
    

or

CREATE DICTIONARY somename (
    Name ClickHouseDataType DEFAULT '' EXPRESSION rand64() HIERARCHICAL INJECTIVE IS_OBJECT_ID
)

How dictionaries can be used in ClickHouse

One of the popular cases of using dictionaries in ClickHouse is aggregation of data by country based on IP (v4) addresses...

Let's imagine that we have a task: from the data of the column with ip String get a column with country String... To solve this problem, we will take quite popular databases GeoIP2 from MaxMind.

MaxMind provides with its .mmdb API bases for most popular programming languages.

ClickHouse does not have the ability to upload a format to a dictionary .mmdb, but we don't need it - MaxMind allows you to upload your databases in the form of several CSVs, which we will use.

In order to associate an IP with a country, we need to download the following files:

  • GeoIP2-Country-Blocks-IPv4.csv - this contains links between IP prefixes and country IDs;
  • GeoIP2-Country-Locations-en.csv - and here are the names of countries in English.

Next, let's create the appropriate dictionaries using DDL:

CREATE DICTIONARY dicts.geoip_country_blocks_ipv4 (
    network String DEFAULT '',
    geoname_id UInt64 DEFAULT 0,
    registered_country_geoname_id UInt64 DEFAULT 0,
    represented_country_geoname_id UInt64 DEFAULT 0,
    is_anonymous_proxy UInt8 DEFAULT 0,
    is_satellite_provider UInt8 DEFAULT 0
) 
PRIMARY KEY network
SOURCE(FILE(
    path '/var/lib/clickhouse/user_files/GeoIP2-Country-Blocks-IPv4.csv'
    format 'CSVWithNames'
))
LAYOUT(IP_TRIE())
LIFETIME(300);

In dictionary geoip_country_blocks_ipv4 we have to specify two main attributes:

  • network - IP prefix of the network, it will also be the dictionary key.
  • geoname_id - Country ID.

The rest of the attributes are in accordance with the header in CSV.

In order for ClickHouse to correctly match the network prefix and ID, we need to use the placement type ip_trie... To obtain values ​​from such a dictionary, it will be necessary to transmit the IP address in numerical form.

Now geoip_country_locations_en:

CREATE DICTIONARY dicts.geoip_country_locations_en (
    geoname_id UInt64 DEFAULT 0,
    locale_code String DEFAULT '',
    continent_code String DEFAULT '',
    continent_name String DEFAULT '',
    country_iso_code String DEFAULT '',
    country_name String DEFAULT '',
    is_in_european_union UInt8 DEFAULT 0
)
PRIMARY KEY geoname_id
SOURCE(FILE(
    path '/var/lib/clickhouse/user_files/GeoIP2-Country-Locations-en.csv'
    format 'CSVWithNames'
))
LAYOUT(HASHED())
LIFETIME(300);

We need to link the ID and the country name. In headlines GeoIP2-Country-Locations-en.csv the following attributes can be found:

  • geoname_id - Country ID, as in the previous dictionary, but now as a key.
  • country_name - the name of the country.

Specify optimized as the placement type hashed...

In each of the dictionaries, you must specify the paths to the corresponding CSV files.

Now having a table user_visits (user_ip String, user_id UUID), we can count the number of unique values ​​by country. One way to do this is to use the dictionary functions. dictGet*:

SELECT 
    dictGetString('dicts.geoip_city_locations_en', 'country_name', users_country_id) AS users_country, 
    uniqs
FROM (
    SELECT 
        dictGetUInt64('dicts.geoip_country_blocks_ipv4', 'geoname_id', tuple(IPv4StringToNum(user_ip))) AS users_country_id, 
        uniq(user_id) AS uniqs
    FROM user_visits
    GROUP BY users_country_id
);

Let's analyze this request:

  1. convert string representation user_ip into numeric and wrapped in a tuple to match the composite key ip_trie-dictionaries: tuple(IPv4StringToNum(user_ip));
  2. use the resulting key to pick up the country ID as users_country_id: dictGetUInt64('geoip_country_blocks_ipv4', 'geoname_id', ...) as users_country_id;
  3. add the metric itself to the request: uniq(user_id) as uniq_users;
  4. we aggregate by country ID, which was taken from the dictionary: GROUP BY users_country_id;
  5. the result containing the IDs of the countries is matched with the names: dictGetString('geoip_city_locations_en', 'country_name', users_country_id) AS users_country...

Thus, it is possible to compare not only the names of countries. The same GeoIP2 databases contain a lot of other useful information, don't be afraid to try 🙂

Conclusion

This completes the initial acquaintance with dictionaries. I hope this information will expand your options for using ClickHouse and help you properly configure external data sources.


0 Comments

Leave a Reply