How to create and use dictionaries in ClickHouse
If you have opened this a door article, then for sure you have already dealt with ClickHouse and you can miss interesting details about its convenience and speed, and go straight to the point – in fact, how to create dictionaries and work with them in ClickHouse.
What are dictionaries in ClickHouse?
Dictionary is a display of data in the form key
→ value
… The advent of dictionaries has greatly simplified the use of third-party data sources in ClickHouse, automating all the necessary ETL processes to deliver data in a queryable form.
There are several points to highlight the advantages of using dictionaries in ClickHouse:
- ClickHouse has support for various options for the location of dictionaries in memory.
- Support
TTL
– ClickHouse automatically updates dictionaries and loads missing values. - ClickHouse provides several options for describing external dictionaries – XML files and DDL queries.
Connecting dictionaries
You can connect your own dictionaries from various data sources: a local text / executable file, an HTTP (s) resource, another DBMS, etc.
The configuration of these dictionaries can be in one or several xml files, the path to which is specified in the parameter dictionaries_config
in the ClickHouse configuration file.
Dictionaries can be loaded at server start or at first use, depending on the setting dictionaries_lazy_load
…
Also, updating dictionaries (except for loading on first use) does not block queries – during the update, queries use the old version of the dictionaries.
To view information about the dictionaries configured on the server, there is a table system.dictionaries
, in it you can find:
- dictionary status;
- configuration parameters;
- metrics such as the amount of RAM occupied by the dictionary or the number of requests to the dictionary since it was successfully loaded.
Dictionary configuration
At the moment there is a way to configure dictionaries through xml files and through DDL queries. You can use any method you like, but the easiest way to create and control dictionaries is using DDL queries.
General appearance of the xml dictionary configuration:
Some comments
/etc/metrika.xml
...
If you chose to create dictionaries through DDL queries, then do not set the dictionary configuration in the server configuration.
Example dictionary configuration:
clients
myHostName
9000
admin
secret_password
clients
users
id<=10
3600
5400
user_id
username
string
age
Int8
Setting fields:
name
– the name of the dictionary;source
– the source of the dictionary;lifetime
– the frequency of updating dictionaries;layout
– placement of the dictionary in memory. The speed of dictionary processing depends on this value;structure
– the structure of the dictionary. Key and attributes that can be obtained by key.
An example of creating a dictionary using a DDL query:
CREATE DICTIONARY dict_users_id (
id UInt64,
username String,
email String,
status UInt16,
hash String
)
PRIMARY KEY id
SOURCE(MYSQL(
port 3306
user clickhouse
password secret_password
replica(host 'mysql1.fevlake.com' priority 1)
db fevlake_dicts
table users
))
LAYOUT(HASHED())
LIFETIME(MIN 3600 MAX 5400);
Sources of external dictionaries
External dictionaries can be connected through many different sources. The main ones are:
- Local file
- Executable file
- HTTP (s)
- DBMS
The most common ways to connect dictionaries are through a local file or a DBMS, so we will consider them below.
Local file
An example of connecting a dictionary via a local file is as follows:
/opt/dictionaries/clients.csv
CSV
Setting fields:
path
Is the absolute path to the file.format
– file format. All ClickHouse formats are supported.
Or via DDL request:
SOURCE(FILE(path '/opt/dictionaries/clients.csv' format 'CSV'))
SETTINGS(format_csv_allow_single_quotes = 0)
DBMS
Let’s consider connecting a DBMS using a MySQL database as an example.
Setting example:
3306
clickhouse
secret_password
example01-1
1
example01-2
1
db_name
table_name
id=10
SQL_QUERY
port
– MySQL server port. Can be set separately for each replica inside the tag
…user
– MySQL username. Can be set separately for each replica inside the tag
…password
– MySQL user password. Can be set separately for each replica inside the tag
…replica
– replica configuration block. There can be several blocks.db
– the name of the database.table
– table name.where
– selection condition. The syntax is exactly the same as the syntax of the sectionWHERE
in MySQL, for example,id >= 3 AND id < 10
(optional parameter).invalidate_query
- request to check the status of the dictionary (optional).
Or via DDL request:
SOURCE(MYSQL(
port 3306
user clickhouse
password secret_password
replica(host 'mysql1.fevlake.com' priority 1)
db fevlake_dicts
table users
))
Storing dictionaries in memory
There are many ways to store dictionaries in ClickHouse memory:
flat
hashed
sparse_hashed
cache
direct
range_hashed
complex_key_hashed
complex_key_cache
complex_key_direct
ip_trie
The most popular of them are only 3, since the processing speed of dictionaries is maximum, this is flat
, hashed
and complex_key_hashed
... Let's take a look at examples of these storage methods.
Flat
Dictionaries are completely stored in RAM as flat arrays, with the amount of memory used being proportional to the size of the dictionary's largest key. The dictionary key must be of type UInt64
and must not be longer than 500,000, otherwise ClickHouse will throw an exception and not create a dictionary.
This storage method provides the best performance of any dictionary storage method available.
Configuration example:
or
LAYOUT(FLAT())
Hashed
The dictionary is completely stored in RAM in the form of hash tables and can contain an arbitrary number of elements with arbitrary identifiers. In practice, the number of keys can reach tens of millions of elements.
Configuration example:
or
LAYOUT(HASHED())
Сomplex_key_hashed
This type of placement is intended for use with composite keys. Is similar hashed
way.
Configuration example:
or
LAYOUT(COMPLEX_KEY_HASHED())
Dictionary key and fields
Section
describes the dictionary key and fields available for queries.
XML description:
user_id
username
string
age
Int8
Setting fields:
- a column with a key;
- data column. Several attributes can be set.
Keys
ClickHouse supports the following types of keys:
- Numeric key.
UInt64
... Described in the tag
or by keywordPRIMARY KEY
... - Composite key. A set of values of different types. Described in the tag
or by keywordPRIMARY KEY
...
Numeric Key
A type: UInt64
...
Configuration example:
user_id
or
CREATE DICTIONARY (
user_id UInt64,
...
)
PRIMARY KEY user_id
...
PRIMARY KEY
- the name of the column with keys.
Composite key
The key can be a tuple (tuple
) from fields of arbitrary types. In this case layout
it should be complex_key_hashed
or complex_key_cache
...
The key structure is specified in the element
... Key fields are specified in the same format as dictionary attributes. Example:
field1
String
field2
UInt32
...
or
CREATE DICTIONARY ( field1 String, field2 String ... )
PRIMARY KEY field1, field2
...
Attributes
...
Name
ClickHouseDataType
rand64()
true
true
true
or
CREATE DICTIONARY somename (
Name ClickHouseDataType DEFAULT '' EXPRESSION rand64() HIERARCHICAL INJECTIVE IS_OBJECT_ID
)
How dictionaries can be used in ClickHouse
One of the popular cases of using dictionaries in ClickHouse is aggregation of data by country based on IP (v4) addresses...
Let's imagine that we have a task: from the data of the column with ip String
get a column with country String
... To solve this problem, we will take quite popular databases GeoIP2 from MaxMind.
MaxMind provides with its .mmdb API bases for most popular programming languages.
ClickHouse does not have the ability to upload a format to a dictionary .mmdb, but we don't need it - MaxMind allows you to upload your databases in the form of several CSVs, which we will use.
In order to associate an IP with a country, we need to download the following files:
GeoIP2-Country-Blocks-IPv4.csv
- this contains links between IP prefixes and country IDs;GeoIP2-Country-Locations-en.csv
- and here are the names of countries in English.
Next, let's create the appropriate dictionaries using DDL:
CREATE DICTIONARY dicts.geoip_country_blocks_ipv4 (
network String DEFAULT '',
geoname_id UInt64 DEFAULT 0,
registered_country_geoname_id UInt64 DEFAULT 0,
represented_country_geoname_id UInt64 DEFAULT 0,
is_anonymous_proxy UInt8 DEFAULT 0,
is_satellite_provider UInt8 DEFAULT 0
)
PRIMARY KEY network
SOURCE(FILE(
path '/var/lib/clickhouse/user_files/GeoIP2-Country-Blocks-IPv4.csv'
format 'CSVWithNames'
))
LAYOUT(IP_TRIE())
LIFETIME(300);
In dictionary geoip_country_blocks_ipv4
we have to specify two main attributes:
network
- IP prefix of the network, it will also be the dictionary key.geoname_id
- Country ID.
The rest of the attributes are in accordance with the header in CSV.
In order for ClickHouse to correctly match the network prefix and ID, we need to use the placement type ip_trie
... To obtain values from such a dictionary, it will be necessary to transmit the IP address in numerical form.
Now geoip_country_locations_en
:
CREATE DICTIONARY dicts.geoip_country_locations_en (
geoname_id UInt64 DEFAULT 0,
locale_code String DEFAULT '',
continent_code String DEFAULT '',
continent_name String DEFAULT '',
country_iso_code String DEFAULT '',
country_name String DEFAULT '',
is_in_european_union UInt8 DEFAULT 0
)
PRIMARY KEY geoname_id
SOURCE(FILE(
path '/var/lib/clickhouse/user_files/GeoIP2-Country-Locations-en.csv'
format 'CSVWithNames'
))
LAYOUT(HASHED())
LIFETIME(300);
We need to link the ID and the country name. In headlines GeoIP2-Country-Locations-en.csv
the following attributes can be found:
geoname_id
- Country ID, as in the previous dictionary, but now as a key.country_name
- the name of the country.
Specify optimized as the placement type hashed
...
In each of the dictionaries, you must specify the paths to the corresponding CSV files.
Now having a table user_visits (user_ip String, user_id UUID)
, we can count the number of unique values by country. One way to do this is to use the dictionary functions. dictGet*
:
SELECT
dictGetString('dicts.geoip_city_locations_en', 'country_name', users_country_id) AS users_country,
uniqs
FROM (
SELECT
dictGetUInt64('dicts.geoip_country_blocks_ipv4', 'geoname_id', tuple(IPv4StringToNum(user_ip))) AS users_country_id,
uniq(user_id) AS uniqs
FROM user_visits
GROUP BY users_country_id
);
Let's analyze this request:
- convert string representation
user_ip
into numeric and wrapped in a tuple to match the composite keyip_trie
-dictionaries:tuple(IPv4StringToNum(user_ip))
; - use the resulting key to pick up the country ID as
users_country_id
:dictGetUInt64('geoip_country_blocks_ipv4', 'geoname_id', ...) as users_country_id
; - add the metric itself to the request:
uniq(user_id) as uniq_users
; - we aggregate by country ID, which was taken from the dictionary:
GROUP BY users_country_id
; - the result containing the IDs of the countries is matched with the names:
dictGetString('geoip_city_locations_en', 'country_name', users_country_id) AS users_country
...
Thus, it is possible to compare not only the names of countries. The same GeoIP2 databases contain a lot of other useful information, don't be afraid to try 🙂
Conclusion
This completes the initial acquaintance with dictionaries. I hope this information will expand your options for using ClickHouse and help you properly configure external data sources.