4 Differences in SQL

Pavel Belyaev — data analyst team lead. Our company is responsible for developing and supporting data marts. Like many others, we faced the need to transfer the infrastructure from a foreign stack to a domestic one.

Our analytical database was built on Google BigQuery for several years. It contained hundreds of views in the Google dialect of SQL, and it was decided to rebase this entire technopark on Russian platforms. It is clear that ClickHouse and BigQuery are far from the same thing, so we had to make a lot of mistakes during the migration. In this article, I will show several differences in SQL of these DBMS. I hope that understanding them will help save time and nerves for those who are faced with a similar task.

JOIN with fuzzy conditions

Let's say we need to put together a showcase in which for each user there is an activity date, such as a payment, and some property that can change over time, such as the user's presence in a certain group.

We have two tables with raw data that need to be joined:

In the table user_group fields link_begin And link_end reflect, respectively, the beginning and end of the user's stay in the given group.

At the output, we should get a table with a group identifier attached to the transaction row if the user was a member of the group at the time of the transaction:

In BQ the problem is solved easily and elegantly:

SELECT t.user_id AS user_id,
    date_paid, amount, group_id
FROM
(
    SELECT user_id, date_paid, amount
    FROM dataset.transactions
) AS t
LEFT JOIN dataset.user_group AS ug 
    ON ug.user_id = t.user_id
    AND t.date_paid >= ug.link_begin
    AND t.date_paid <= ug.link_end

This trick won't work in ClickHouse: there are no joins by several fuzzy conditions. But you can use functions for working with arrays. Adaptation of the query is done in two steps:

  1. we join by equality, while grouping the fields that are not used in the join condition into an array using the function groupArray();

  2. in the external select we use a filter by array arrayFilter()to extract only the rows that meet the fuzzy conditions.

The query for ClickHouse will therefore look like this:

SELECT t.user_id AS user_id,
    date_paid, amount,
    arrayFilter(x-> x.2<=date_paid AND x.3>=date_paid, ug.params)[1].1 AS group_id
FROM
(
    SELECT user_id, date_paid, amount
    FROM dataset.transactions
) AS t
LEFT JOIN 
(
  SELECT user_id,
    groupArray(tuple(group_id, link_begin, link_end)) AS params
  FROM dataset.user_group 
  GROUP BY 1
) AS ug ON ug.user_id = t.user_id

Calculate columns at once

BQ specialists will be unaccustomed to the fact that fields calculated in a subquery in ClickHouse can be used in the same subquery. Let's consider a simple query:

SELECT a+1 AS a, a+2 AS b
FROM ( SELECT 1 AS a )

In BQ it will return the result:

Field b here it is calculated based on the value of the field a from the source. But ClickHouse will give us another option:

Here for calculation b the newly modified value of the original is used aThis feature requires a thorough review of the BQ showcases, because the queries will not return errors, but the calculations may “go”.

At the same time, this order of calculations can be considered as an opportunity: where BigQuery required an additional query, ClickHouse allows you to organize several levels of calculations in one subquery.

SELECT a+1 AS a, a+2 AS b, b*3 AS c
FROM ( SELECT 1 AS a )

Storing and displaying the date

When transferring data containing date fields from BigQuery to ClickHouse, it is important to keep in mind the following nuance. In BQ, dates are stored in the UTC time zone by default. That is, when you set a date, it is converted to UTC:

SELECT utc, STRING(utc) AS utc_str,
    msk, STRING(msk) AS msk_str,
    msk3, STRING(msk3) AS msk3_str
FROM
(
    SELECT TIMESTAMP("2024-05-21 00:00:00") AS utc,
        TIMESTAMP("2024-05-21 00:00:00", "Europe/Moscow") AS msk,
        TIMESTAMP("2024-05-21 00:00:00+03") AS msk3
)

If the timestamp is set in the Moscow time zone, three hours will be subtracted from the time value. It is also clear that when converting from a timestamp to a string, the significant information does not change. You can easily and without distortion convert the date to a string and extract, for example, the year and month:

SELECT LEFT(STRING(TIMESTAMP("2024-05-21 00:00:00", "Europe/Moscow")), 7) AS period

In ClickHouse, by default, the date will be stored in the time zone specified in the server settings. Specifying the time zone when setting the date and time value will not change a significant part of the data, but will add a time zone label that will be lost when converted to a string:

SELECT utc, CAST(utc AS String) AS utc_str,
    msk, CAST(msk AS String) AS msk_str
FROM
(
SELECT toDateTime('2024-05-21 00:00:00') AS utc ,
    toDateTime('2024-05-21 00:00:00', 'Europe/Moscow') AS msk 
)

I recommend not playing with time zones, but simply adding the required number of hours and be sure to check with the source of data that is streamed in ClickHouse.

Setting join_use_nulls

In BigQuery, if a join in the right table does not find a row that matches the join conditions, the resulting fields of that row will be filled with an empty space – NULL, regardless of the data type of those fields.

SELECT *
FROM
(
    SELECT 1 AS a, 'b' AS b
    UNION ALL
    SELECT 2 AS a, 'c' AS b
) AS t1
LEFT JOIN
(
    SELECT 1 AS a1, 'd' AS d, CURRENT_DATE() AS t
) AS t2 ON t1.a = t2.a1

In ClickHouse, things are not so clear: the result depends on the settings of the user on whose behalf the request is launched. We are talking about the settings join_use_nulls. If it is set to 1, the result will be the same as in BigQuery. But if join_use_nulls = 0then instead of NULL, 0 will be transferred to the result for numeric values, an empty string ” for the String type and 1970-01-01, that is, 0, for the date:

If you are dealing with large tables and complex queries, this feature can be a nuisance, so set this setting for all users right away. join_use_nulls = 1, and for “safety reasons” it can be added to the request itself:

SELECT *
FROM
(
    SELECT 1 AS a, 'b' AS b
    UNION ALL
    SELECT 2 AS a, 'c' AS b
) AS t1
LEFT JOIN
(
    SELECT 1 AS a1, 'd' AS d, today() AS t
) AS t2 ON t1.a = t2.a1
SETTINGS join_use_nulls = 1

Conclusion

So, BigQuery is not the same as ClickHouse. When moving from the first to the second, you can keep the following in mind:

  • JOIN with fuzzy conditions of the form t1.a > t2.b in ClickHouse can be implemented by first grouping the rows into an array using groupArray(), and then extracting the required lines using arrayFilter().

  • By default, in ClickHouse, calculated columns can be used in the same SELECT in which they are calculated, so pay close attention to field aliases!

  • BQ and CH store and display dates differently: BigQuery — in the UTC time zone, and ClickHouse — in the time zone specified in the server settings. To avoid confusion, do not use time zones in queries and be sure to compare dates in the showcases with the source.

  • When joining in ClickHouse, fields not found in the joined table may have a value other than NULL, depending on the settings join_use_nulls. Set it to 1 if you want behavior similar to BigQuery.

ENTER — DIY media for IT professionals. Share personal stories about solving various IT problems and get rewarded.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *