Array of Clickhouse functions

When you analyze data, basic SQL functions are often not enough, especially when it comes to complex queries and processing large amounts of information. In such cases, array functions in ClickHouse come to the rescue. However, many users are unaware of their existence or do not use them to their full potential.

This article is a short guide to array functions in ClickHouse. We will look at the most useful and powerful tools, such as arrayJoin, arrayMap, arrayFilterand others. Let's look at how to use them to solve everyday data analytics problems using specific examples.

Why is this important? Because knowing how to work with arrays correctly allows you to shorten and simplify your code, making it more readable and maintainable. This is a key skill for anyone who wants to write optimal queries.

You may find array functions useful when standard SQL queries become complex and difficult to understand. For example, instead of using multiple subqueries and joins to track a sequence of user actions, you can use array functions. In addition, using functions allows you to filter elements within an array, eliminating the need to write complex conditions in subqueries. Array functions in ClickHouse will help you reduce the amount of code and simplify queries, replacing multiple subqueries with more elegant and readable solutions.

Creating a test table:

Before we look at the main functions for working with arrays in ClickHouse, let's create a test table. This table will allow you to practice using the described functions using an example. Let's create a table with user events – such as clicks, views, and purchases – and fill it with data. With this data, you will be able to experiment, applying various functions to arrays and checking their work in practice.

CREATE TABLE user_events (
user_id UInt32,
event_date Date,
event_type String,
item_id UInt32,
price Float32,
quantity UInt32
) ENGINE = MergeTree()
ORDER BY (user_id, event_date);

INSERT INTO user_events VALUES
(1, '2024-07-01', 'click', 1, 10.0, 1),
(1, '2024-07-01', 'click', 2, 15.0, 1),
(1, '2024-07-02', 'view', 3, 20.0, 1),
(1, '2024-07-03', 'purchase', 1, 10.0, 2),
(1, '2024-07-03', 'purchase', 3, 20.0, 1),
(2, '2024-07-01', 'click', 4, 25.0, 1),
(2, '2024-07-02', 'purchase', 2, 15.0, 2),
(2, '2024-07-02', 'purchase', 4, 25.0, 1),
(3, '2024-07-01', 'view', 5, 30.0, 1),
(3, '2024-07-03', 'click', 6, 35.0, 1)

groupArray, arrayStringConcat and groupUniqArray

groupArray: This function collects all the values ​​of a column into an array for each group created by GROUP BY. This allows you to group data by user and collect all events associated with each user in one array.

arrayStringConcat: The function concatenates the elements of an array into a single string, separating them with the specified separator. This is useful when you need to represent a sequence of actions or events as a single string, for example, for reporting or visualization.

groupUniqArray: The function collects unique elements into an array, removing duplicates. This is useful for analyzing the diversity of user actions, eliminating repetitions.

To combine the event types for each user into one row, first collect all the events into an array using groupArrayand then we'll combine them into a string with arrayStringConcat:

SELECT user_id, arrayStringConcat(groupArray(event_type), ', ') AS concatenated_events
FROM user_events
GROUP BY user_id;

user_id

concatenated_events

3

view, click

2

click, purchase, purchase

1

click, click, view, purchase, purchase

If you are only interested in the unique actions of each user, you can replace groupArray on groupUniqArray:

SELECT user_id, arrayStringConcat(groupUniqArray(event_type), ', ') AS concatenated_events
FROM user_events
GROUP BY user_id;

user_id

concatenated_events

3

view, click

2

purchase, click

1

purchase, click, view

These features are useful for creating reports that show the full range of user actions over a period of time. You can analyze which types of actions are most common for specific users. This helps you gain a deeper understanding of audience behavior.

arrayFilter

arrayFilter: This function allows you to filter the elements of an array based on a given condition. It returns a new array containing only those elements that match the condition defined using the lambda expression.

Function arrayFilter is useful when you want to extract from an array only those elements that meet certain criteria. In our example, this condition is the price being greater than 15.

Function Syntax: arrayFilter(x -> condition, array)

x: A variable representing the current element of the array. In this context, this can be any variable name you specify.

->: This operator separates the variable from the condition. It is a symbol that is used in lambda expressions to indicate that the variable x is used in the expression on the right.

condition: A condition that the elements of an array must meet. For example, x > 15 means that only those elements that are greater than 15 will be included in the new array.

array: The source array to be filtered.

SELECT user_id, event_type, arrayFilter(x -> x > 15, groupArray(price)) AS expensive_items
FROM user_events
GROUP BY user_id, event_type;

user_id

event_type

expensive_items

2

purchase

[25.0]

3

click

[35.0]

1

purchase

[20.0]

3

view

[30.0]

1

view

[20.0]

1

click

[]

2

click

[25.0]

This query is useful for analyzing purchases and identifying meaningful data such as the expensive items that users purchased.

arraySlice

arraySlice: This function extracts a subarray from the original array, starting at the specified index and with the specified length. It allows you to select only certain elements of the array, which is useful for analyzing the first or last elements.

For example, if you need to get only the first few elements of an array, this function allows you to do that.

Let's say we want to get the first two elements of the price array for each event type. We can use arraySlice to extract these values:


SELECT user_id, event_type, arraySlice(groupArray(price), 1, 2) AS first_two_prices
FROM user_events
GROUP BY user_id, event_type;

user_id

event_type

first_two_prices

2

purchase

[15.0, 25.0]

3

click

[35.0]

1

purchase

[10.0, 20.0]

3

view

[30.0]

1

view

[20.0]

1

click

[10.0, 15.0]

2

click

[25.0]

Usage arraySlice helps you focus on the first values ​​in your data, which is useful when analyzing trends or getting initial values ​​that may be most useful for your research.

arrayDistinct

arrayDistinct: This function removes duplicate values ​​from an array, returning an array containing only unique elements. Function arrayDistinct is useful when you need to eliminate duplicates in data. In cases where users may send the same events multiple times (e.g. multiple button clicks), arrayDistinct allows you to create a report without unnecessary repetitions.

Suppose we want to get unique prices of products purchased by users. We can use arrayDistinct for this

SELECT user_id, event_type, arrayDistinct(groupArray(price)) AS unique_prices
FROM user_events
GROUP BY user_id, event_type;

user_id

event_type

unique_prices

2

purchase

[15.0, 25.0]

3

click

[35.0]

1

purchase

[10.0, 20.0]

3

view

[30.0]

1

view

[20.0]

1

click

[10.0, 15.0]

2

click

[25.0]

arrayEnumerate

Let's add some data:

INSERT INTO user_events VALUES
(1, '2024-07-01', 'purchase', 1, 10.0, 1),
(1, '2024-07-02', 'purchase', 2, 15.0, 2),
(1, '2024-07-03', 'purchase', 3, 20.0, 1),
(2, '2024-07-01', 'purchase', 4, 25.0, 1),
(2, '2024-07-02', 'purchase', 5, 30.0, 1),
(2, '2024-07-03', 'purchase', 6, 35.0, 1),
(3, '2024-07-01', 'purchase', 7, 40.0, 1),
(3, '2024-07-02', 'purchase', 8, 45.0, 1),
(3, '2024-07-03', 'purchase', 9, 50.0, 1);

arrayEnumerate: This function adds each element of an array to its index (position) and returns an array of (index, value) pairs. This is useful for analyzing a sequence of elements or keeping track of their order.

Function arrayEnumerate helps when you need to analyze the order of elements in an array or create unique identifiers for elements in an array. This is useful, for example, for tracking the sequence of purchases or other events.

Let's say we want to analyze at what stage of the shopping process users most often buy products priced above 30. First, we use arrayEnumerateto determine the position of each purchase, and then analyze purchases by these positions.

SELECT
    user_id,
    arrayEnumerate(groupArray((event_date, item_id, price, quantity))) AS purchase_positions,
    groupArray((event_date, item_id, price, quantity)) AS purchases
FROM user_events
WHERE event_type="purchase"
GROUP BY user_id;

user_id

purchase_positions

purchases

3

[1,2,3]

[‘[2024-07-01, 7, 40.0, 1]', '[2024-07-02, 8, 45.0, 1]', '[2024-07-03, 9, 50.0, 1]']

2

[1,2,3,4,5]

[‘[2024-07-01, 4, 25.0, 1]', '[2024-07-02, 5, 30.0, 1]', '[2024-07-03, 6, 35.0, 1]', '[2024-07-02, 2, 15.0, 2]', '[2024-07-02, 4, 25.0, 1]']

1

[1,2,3,4,5]

[‘[2024-07-01, 1, 10.0, 1]', '[2024-07-02, 2, 15.0, 2]', '[2024-07-03, 3, 20.0, 1]', '[2024-07-03, 1, 10.0, 2]', '[2024-07-03, 3, 20.0, 1]']

Now we can use the obtained data to analyze purchases:

WITH purchase_data AS (
    SELECT
        user_id,
        arrayEnumerate(groupArray((event_date, item_id, price, quantity))) AS purchase_positions,
        groupArray((event_date, item_id, price, quantity)) AS purchases
    FROM user_events
    WHERE event_type="purchase"
    GROUP BY user_id
)
SELECT
position,
count() AS count_above_30
FROM purchase_data
ARRAY JOIN purchase_positions AS position, purchases AS purchase
WHERE purchase.purchases > 30
GROUP BY position
ORDER BY position;

user_id

count_above_30

1

1

2

1

3

1

5

1

Function arrayEnumerate and is convenient for tasks related to the analysis of positions or order of elements in arrays, which can be useful when working with time series or sequences of user actions.

countEqual

Function countEqual is used to count the number of occurrences of a specific value in an array. It returns a number indicating how many times the specified value appears in the array.

Let's say we want to know how many times each user performed the “purchase” action in our dataset:

SELECT
    user_id,
    countEqual(groupArray(event_type), 'purchase') AS purchase_count
FROM user_events
GROUP BY user_id;

user_id

purchase_count

1

2

2

2

3

0

arrayReduce and arrayMap

arrayMap

Function arrayMap: Allows you to apply a given expression or function to each element of an array. This is useful for transforming data in an array according to a specific rule.

Syntax: arrayMap(x -> expression, array)

x: A variable representing the current element of the array.

expression: An expression or function applied to each element of the array.

array: The source array whose elements will be transformed.

Let's say we have an array of purchase prices, and we want to double the prices of those purchases that cost more than 20:

SELECT
    user_id,
    arrayMap(x -> if(x > 20, x * 2, x), purchases) AS conditional_purchases
FROM user_purchases;

In this example arrayMap uses a conditional expression to change the elements of an array based on their value.

Also a function arrayMap can take multiple arrays as arguments. This allows you to perform more complex transformations using data from multiple sources.

Syntax: arrayMap((x, y) -> expression, array1, array2)

x And y: Variables for current array elements.

expression: An expression applied to array elements.

array1 And array2: Source Arrays

arrayReduce

Function arrayReduce: Applies an aggregate function to all elements of an array. This is useful for performing aggregation of data after it has been preprocessed.

Syntax: arrayReduce('aggregate_function', array)

aggregate_function: The name of an aggregate function, such as sum, avg, min, max, uniq, count and others.

array: The array to which the aggregate function is applied.

We want to know the number of unique items each user purchased for more than 30. We can do this in several steps:

Create a temporary table: Grouping data by user_id and create an array of purchases for each user.

WITH purchases AS (
    SELECT
        user_id,
        groupArray((item_id, price, quantity)) AS purchases
    FROM user_events
    WHERE event_type="purchase"
    GROUP BY user_id
)

Filtering and aggregation: We apply arrayFilter, arrayMap And arrayReduce to obtain the number of unique products with a purchase amount greater than 30.

SELECT
    user_id,
    arrayReduce(
        'uniq',
        arrayMap(p -> p.item_id,
            arrayFilter(p -> p.price * p.quantity > 30, purchases)
        )
    ) AS unique_items
FROM purchases

WITH purchases: Create a temporary table purchasesgrouping data by user_id and creating an array of purchases for each user.

arrayFilter: Filter purchases, leaving only those where price * quantity > 30.

arrayMap: Transform the filtered array by extracting item_id every purchase.

arrayReduce: Applying an aggregate function uniqto count the number of unique products.

The result of the query will be a table with the number of unique products that each user purchased for more than 30.

user_id

unique_items

3

3

2

1

1

0

arrayjoin

Function arrayJoin in ClickHouse allows you to “multiply” each row of a table into multiple rows based on an array. Not to be confused with the usual JOINwhich is used to combine data from different tables. Instead, arrayJoin expands a single row into multiple rows based on the contents of an array. This is especially useful when data is stored as arrays and you need to convert it into a tabular format for further analysis.

Difference from usual JOIN:

arrayJoin: Works with arrays within a single table, converting the array into multiple rows. Does not merge data from different tables.

Ordinary JOIN: Combines data from two or more tables based on a common key or condition.

Syntax:

SELECT
    arrayJoin(array)
FROM
    table_name;

Create a date range

In this example we will create a range of all dates from the minimum to the maximum date in the table price_changes. We use the function arrayJointo convert an array of dates into separate strings.

WITH
    -- Определяем минимальную и максимальную дату из таблицы price_changes
    bounds AS (
        SELECT min(change_date) AS start_date, max(change_date) AS end_date
        FROM price_changes
    )

    -- Создаем массив всех дат в диапазоне от start_date до end_date
    SELECT arrayJoin(
        arrayMap(
            x -> toDate(x),
            range(toUInt32(bounds.start_date), toUInt32(bounds.end_date) + 1)
        )
    ) AS date
    FROM bounds

WITH BOUNDS: Create a temporary table boundswhere we find the minimum and maximum date from the table price_changes.

range: Create a range of numbers from a minimum to a maximum date, converted to numbers.

arrayMap: Convert numbers to dates using a function toDate.

arrayjoin: We expand the array of dates into separate lines.

Function arrayJoin useful when it is necessary to convert an array into a set of rows for further analysis, to generate all possible values ​​in a given range, for example, to create reports by dates, also when data is stored as arrays and their conversion into a tabular format is required for analysis.

We've covered in detail the key functions for working with arrays in ClickHouse, such as arrayJoin, arrayMap, arrayReduceand others. These tools allow you to effectively manage and analyze data arrays, which can greatly simplify the tasks of processing and aggregating information. Using these tools in practice can make your queries more compact and your code more readable. I hope this guide will help you in your daily work.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *