How to create and use dictionaries in ClickHouse
If you have opened this
a door article, then for sure you have already dealt with ClickHouse and you can miss interesting details about its convenience and speed, and go straight to the point – in fact, how to create dictionaries and work with them in ClickHouse.
What are dictionaries in ClickHouse?
Dictionary is a display of data in the form
value… The advent of dictionaries has greatly simplified the use of third-party data sources in ClickHouse, automating all the necessary ETL processes to deliver data in a queryable form.
There are several points to highlight the advantages of using dictionaries in ClickHouse:
- ClickHouse has support for various options for the location of dictionaries in memory.
TTL– ClickHouse automatically updates dictionaries and loads missing values.
- ClickHouse provides several options for describing external dictionaries – XML files and DDL queries.
You can connect your own dictionaries from various data sources: a local text / executable file, an HTTP (s) resource, another DBMS, etc.
The configuration of these dictionaries can be in one or several xml files, the path to which is specified in the parameter
dictionaries_config in the ClickHouse configuration file.
Dictionaries can be loaded at server start or at first use, depending on the setting
Also, updating dictionaries (except for loading on first use) does not block queries – during the update, queries use the old version of the dictionaries.
To view information about the dictionaries configured on the server, there is a table
system.dictionaries, in it you can find:
- dictionary status;
- configuration parameters;
- metrics such as the amount of RAM occupied by the dictionary or the number of requests to the dictionary since it was successfully loaded.
At the moment there is a way to configure dictionaries through xml files and through DDL queries. You can use any method you like, but the easiest way to create and control dictionaries is using DDL queries.
General appearance of the xml dictionary configuration:
Some comments /etc/metrika.xml ...
If you chose to create dictionaries through DDL queries, then do not set the dictionary configuration in the server configuration.
Example dictionary configuration:
myHostName 9000 admin secret_password clients3600 5400 user_id username string age Int8
name– the name of the dictionary;
source– the source of the dictionary;
lifetime– the frequency of updating dictionaries;
layout– placement of the dictionary in memory. The speed of dictionary processing depends on this value;
structure– the structure of the dictionary. Key and attributes that can be obtained by key.
An example of creating a dictionary using a DDL query:
CREATE DICTIONARY dict_users_id ( id UInt64, username String, email String, status UInt16, hash String ) PRIMARY KEY id SOURCE(MYSQL( port 3306 user clickhouse password secret_password replica(host 'mysql1.fevlake.com' priority 1) db fevlake_dicts table users )) LAYOUT(HASHED()) LIFETIME(MIN 3600 MAX 5400);
Sources of external dictionaries
External dictionaries can be connected through many different sources. The main ones are:
- Local file
- Executable file
- HTTP (s)
The most common ways to connect dictionaries are through a local file or a DBMS, so we will consider them below.
An example of connecting a dictionary via a local file is as follows:
pathIs the absolute path to the file.
format– file format. All ClickHouse formats are supported.
Or via DDL request:
SOURCE(FILE(path '/opt/dictionaries/clients.csv' format 'CSV')) SETTINGS(format_csv_allow_single_quotes = 0)
Let’s consider connecting a DBMS using a MySQL database as an example.
3306 clickhouse secret_password example01-1 1 example01-2 1 db_name
port– MySQL server port. Can be set separately for each replica inside the tag …
user– MySQL username. Can be set separately for each replica inside the tag …
password– MySQL user password. Can be set separately for each replica inside the tag …
replica– replica configuration block. There can be several blocks.
db– the name of the database.
table– table name.
where– selection condition. The syntax is exactly the same as the syntax of the section
WHEREin MySQL, for example,
id >= 3 AND id < 10(optional parameter).
invalidate_query- request to check the status of the dictionary (optional).
Or via DDL request:
SOURCE(MYSQL( port 3306 user clickhouse password secret_password replica(host 'mysql1.fevlake.com' priority 1) db fevlake_dicts table users ))
Storing dictionaries in memory
There are many ways to store dictionaries in ClickHouse memory:
The most popular of them are only 3, since the processing speed of dictionaries is maximum, this is
complex_key_hashed... Let's take a look at examples of these storage methods.
Dictionaries are completely stored in RAM as flat arrays, with the amount of memory used being proportional to the size of the dictionary's largest key. The dictionary key must be of type
UInt64 and must not be longer than 500,000, otherwise ClickHouse will throw an exception and not create a dictionary.
This storage method provides the best performance of any dictionary storage method available.
The dictionary is completely stored in RAM in the form of hash tables and can contain an arbitrary number of elements with arbitrary identifiers. In practice, the number of keys can reach tens of millions of elements.
This type of placement is intended for use with composite keys. Is similar
Dictionary key and fields
Section describes the dictionary key and fields available for queries.
user_id username string age Int8
- - a column with a key;
- - data column. Several attributes can be set.
ClickHouse supports the following types of keys:
- Numeric key.
UInt64... Described in the tag or by keyword
- Composite key. A set of values of different types. Described in the tag or by keyword
CREATE DICTIONARY ( user_id UInt64, ... ) PRIMARY KEY user_id ...
PRIMARY KEY- the name of the column with keys.
The key can be a tuple (
tuple) from fields of arbitrary types. In this case
layout it should be
The key structure is specified in the element ... Key fields are specified in the same format as dictionary attributes. Example:
field1 String field2 UInt32 ...
CREATE DICTIONARY ( field1 String, field2 String ... ) PRIMARY KEY field1, field2 ...
... Name ClickHouseDataType rand64() true true true
CREATE DICTIONARY somename ( Name ClickHouseDataType DEFAULT '' EXPRESSION rand64() HIERARCHICAL INJECTIVE IS_OBJECT_ID )
How dictionaries can be used in ClickHouse
One of the popular cases of using dictionaries in ClickHouse is aggregation of data by country based on IP (v4) addresses...
Let's imagine that we have a task: from the data of the column with
ip String get a column with
country String... To solve this problem, we will take quite popular databases GeoIP2 from MaxMind.
MaxMind provides with its .mmdb API bases for most popular programming languages.
ClickHouse does not have the ability to upload a format to a dictionary .mmdb, but we don't need it - MaxMind allows you to upload your databases in the form of several CSVs, which we will use.
In order to associate an IP with a country, we need to download the following files:
GeoIP2-Country-Blocks-IPv4.csv- this contains links between IP prefixes and country IDs;
GeoIP2-Country-Locations-en.csv- and here are the names of countries in English.
Next, let's create the appropriate dictionaries using DDL:
CREATE DICTIONARY dicts.geoip_country_blocks_ipv4 ( network String DEFAULT '', geoname_id UInt64 DEFAULT 0, registered_country_geoname_id UInt64 DEFAULT 0, represented_country_geoname_id UInt64 DEFAULT 0, is_anonymous_proxy UInt8 DEFAULT 0, is_satellite_provider UInt8 DEFAULT 0 ) PRIMARY KEY network SOURCE(FILE( path '/var/lib/clickhouse/user_files/GeoIP2-Country-Blocks-IPv4.csv' format 'CSVWithNames' )) LAYOUT(IP_TRIE()) LIFETIME(300);
geoip_country_blocks_ipv4 we have to specify two main attributes:
network- IP prefix of the network, it will also be the dictionary key.
geoname_id- Country ID.
The rest of the attributes are in accordance with the header in CSV.
In order for ClickHouse to correctly match the network prefix and ID, we need to use the placement type
ip_trie... To obtain values from such a dictionary, it will be necessary to transmit the IP address in numerical form.
CREATE DICTIONARY dicts.geoip_country_locations_en ( geoname_id UInt64 DEFAULT 0, locale_code String DEFAULT '', continent_code String DEFAULT '', continent_name String DEFAULT '', country_iso_code String DEFAULT '', country_name String DEFAULT '', is_in_european_union UInt8 DEFAULT 0 ) PRIMARY KEY geoname_id SOURCE(FILE( path '/var/lib/clickhouse/user_files/GeoIP2-Country-Locations-en.csv' format 'CSVWithNames' )) LAYOUT(HASHED()) LIFETIME(300);
We need to link the ID and the country name. In headlines
GeoIP2-Country-Locations-en.csv the following attributes can be found:
geoname_id- Country ID, as in the previous dictionary, but now as a key.
country_name- the name of the country.
Specify optimized as the placement type
In each of the dictionaries, you must specify the paths to the corresponding CSV files.
Now having a table
user_visits (user_ip String, user_id UUID), we can count the number of unique values by country. One way to do this is to use the dictionary functions.
SELECT dictGetString('dicts.geoip_city_locations_en', 'country_name', users_country_id) AS users_country, uniqs FROM ( SELECT dictGetUInt64('dicts.geoip_country_blocks_ipv4', 'geoname_id', tuple(IPv4StringToNum(user_ip))) AS users_country_id, uniq(user_id) AS uniqs FROM user_visits GROUP BY users_country_id );
Let's analyze this request:
- convert string representation
user_ipinto numeric and wrapped in a tuple to match the composite key
- use the resulting key to pick up the country ID as
dictGetUInt64('geoip_country_blocks_ipv4', 'geoname_id', ...) as users_country_id;
- add the metric itself to the request:
uniq(user_id) as uniq_users;
- we aggregate by country ID, which was taken from the dictionary:
GROUP BY users_country_id;
- the result containing the IDs of the countries is matched with the names:
dictGetString('geoip_city_locations_en', 'country_name', users_country_id) AS users_country...
Thus, it is possible to compare not only the names of countries. The same GeoIP2 databases contain a lot of other useful information, don't be afraid to try 🙂
This completes the initial acquaintance with dictionaries. I hope this information will expand your options for using ClickHouse and help you properly configure external data sources.