ClickHouse. DWH. We are developing a network franchise for breeding rabbits in a columnar Data Vault

Introduction

The modern approach to DWH is replete with various design patterns. But alas, most people who are directly involved in the development and implementation of these approaches are riding on rails and are afraid to take a step aside. Not all – after all, there are a huge number of creatives in IT – but many.

In this article, in a semi-joking manner, I would like to reflect on the topic that a template is just a template, and not a guide to action.

Let’s move on to programming terms. What is Data Vault? Tool, technology or pattern? The correct answer, of course, is a pattern. After all, the authors of this term did not bring to the market a finished product or tools for implementing the approach – they appeared later – but presented the world with a set of rules and paradigms.

  • to what extent should the data be normalized before you can read your Data Vault model

  • determined strict restrictions on the construction of the model

  • identified the main elements of the model and possible connections between them

And the fact that most build these models on relational databases with a row data storage structure is the choice of the specialist implementing the template. But not a strict rule or restriction. Well, plus there wasn’t such an abundance of other types of databases when the approach was born.

Business model

Have you ever dreamed of starting an exotic business? And that’s what we’ll do now.

By the will of fate, you inherited a successful franchise for breeding rabbits around the world. Hundreds of thousands of farms of various sizes. Billions of animals. Huge referral network. Your grandfather was so advanced that he was able to start a chain reaction and now the system is growing on its own, attracting more and more new participants. But alas, my grandfather, in addition to his love for animals, was an avid front-end developer and the database for him was only a means of storing data about his network. And no more.

But you. You are another matter. Looking at all this wealth, you immediately realized that new horizons were opening up before you. For example, opening a marketplace for trading meat or live animals. You can integrate the sales market into your model and do much, much more. What is needed for this? That’s right, for this you need analytics. Powerful and precise.

Therefore, we will try to start building a data system as a product based on the IT infrastructure inherited from our grandfather. Which we ourselves can use for a new stage of growth for the entire company.

Task

What do you need to get started? Highlight the main goal. So – first of all we want to start trading animals or their meat. The product is specific and can only be sold on the internal city market. I don’t want to carry it any further. And there is no need, because our network is so large that it is available almost everywhere. Why bother with logistics, which in itself is an extremely complex issue.

Well, in order not to turn the article into a huge boring read, we set ourselves a task with zero priority:

  • Build scalable a data model that will be able to receive information as quickly as possible from our processing system and record it in storage.

  • The first step in analytics is to be able to fast (~10 seconds) understand what animals and in what quantity there are in each specific city.

  • Also fast (~10 seconds) find the farms that attracted the most participants. Or their referrals are the most effective – they have the largest number of animals.

Model building

Since you love order and hate a lot of data redundancy, the choice fell on the Data Vault concept. The first thing, of course, is to sit down and draw a model that can always be expanded and which will allow us to complete the task at a minimum.

After a little thought we came up with the following: Hubs:

  • Animals. Let there be animals. What if we decide to breed muskrats, I heard somewhere that their skins are valuable. Let’s add a type attribute in the future and enjoy the flexibility of the system.

  • Farms. Well, where would we be without them?

  • Cities. We decided to build a district-oriented business. And regions and other structures higher in level can always be added later.

Now you need to connect these Hubs and add a minimum of attributes. Before this, let’s find out the answers to a number of questions from businesses (attention, below is a spherical horse in a vacuum):

  • Have rabbits been identified in the system? – Of course, from each farm they send us data about birth, death and sale animal. This is a condition of franchising.

  • How are rabbits identified in the system? – Well, this is the farm number + the number of the rabbit on this farm. We assign it to each animal when the user enters data in his personal account.

  • Is there any information about the relationship? – Certainly. The mother is always indicated – they live separately until childbirth. And father – well, farmers don’t monitor the sex life of their animals.

  • Does the referral subsystem also store information about who referred whom? – Well, it’s not that it’s a subsystem. But when we fix a new agreement, the farm that attracted the new participant is always indicated.

Having drawn a little in the first tool that came our way, we got something similar to a growth point for our Data Vault.

We highlight the following main aspects:

  • recursive family connections of rabbits and referral dependence of farms connected through Links

  • a couple of immutable parameters were added to the attributes of animals and cities (the name of the city is conditionally immutable) so that a minimal pilot launch could be carried out

  • the color of the rabbit was taken out as an attribute for simplification, in real conditions it would, of course, be a separate Hub, and maybe even several, since colors are also a whole science. Well, it’s good that color should be a weakly changeable attribute, since some individuals can change it after the first molt, and not every farmer can immediately determine the color accurately.

  • animal status – this is its current state, and changeable – sold/sold alive/alive/died a natural death. Therefore, we put it in a separate satellite. We will not explain the difference between ‘sold’ and ‘sold alive’. So as not to look too cruel. Status is also ideally a separate Hub. But we’re simplifying for the sake of the article.

  • hash keys, loading time and source – solely in order to be able to scale and change the technical base. After all, a model is a universal entity and can be implemented anywhere. Even go to the Data Lake with an avalanche of data growth. My grandfather has so many interesting things in his database that it would take years to sort through. There are even photographs from the ‘Rabbit of the Year’ competition.

Implementation of the model

Rolling up your sleeves, you began to think about what technologies to build your model on. By the will of fate, the choice fell on ClickHouse.

  • The columnar structure allows for lightning-fast analytics on huge amounts of data

  • Quickly add data using a variety of formats – from direct inserts to json

  • Quite convenient and fast mechanisms for updating by partition; after all, the data is updated in the past and it would be nice to reload the last couple of months at least once a day. You can read more details here: Partitioning as a means of quickly updating data

So. All that remains is to get yourself a machine on Ubuntu, gain access to the processing database and you can get started. No sooner said than done. After a couple of hours of running around the office and talking with admins, you sit in front of the terminal and install ClickHouse for your pilot. Ready.

Let’s get started. First, let’s create our small model physically – in the form of real tables. It would be nice, of course, to transfer it to some kind of modeling tool with the ability to generate code for different databases. But at the stage of initial development, it is better to do everything by hand in order to take into account the nuances that may emerge when our brainchild goes into production.

SQL. Creating tables
-- drop table h_animals
CREATE TABLE h_animals
(
  animal_num String,
  animal_hsh FixedString(64),
  timestamp_ DateTime,
  source_ String,
) ENGINE=MergeTree()
PARTITION BY (toYYYYMM(timestamp_))
ORDER BY (timestamp_);
-- drop table l_animal_tree
CREATE TABLE l_animal_tree
(
  animal_hsh FixedString(64),
  animal_mother_hsh FixedString(64),
  timestamp_ DateTime,
  source_ String,
) ENGINE=MergeTree()
PARTITION BY (toYYYYMM(timestamp_))
ORDER BY (timestamp_);
-- drop table s_animal_lifecycle
CREATE TABLE s_animal_lifecycle
(
  animal_hsh FixedString(64),
  timestamp_ DateTime,
  source_ String,
  status String,
) ENGINE=MergeTree()
PARTITION BY (toYYYYMM(timestamp_))
ORDER BY (timestamp_);
-- drop table s_animal_attrs
CREATE TABLE s_animal_attrs
(
  animal_hsh FixedString(64),
  timestamp_ DateTime,
  source_ String,
  sex String,
  color String,
  birthdate date,
) ENGINE=MergeTree()
PARTITION BY (toYYYYMM(timestamp_))
ORDER BY (timestamp_);
-- drop table h_farms
CREATE TABLE h_farms
(
  farm_num String,
  farm_hsh FixedString(64),
  timestamp_ DateTime,
  source_ String,
) ENGINE=MergeTree()
PARTITION BY (toYYYYMM(timestamp_))
ORDER BY (timestamp_);
-- drop table l_animal_farms
CREATE TABLE l_animal_farms
(
  animal_hsh FixedString(64),
  farm_hsh FixedString(64),
  timestamp_ DateTime,
  source_ String,
) ENGINE=MergeTree()
PARTITION BY (toYYYYMM(timestamp_))
ORDER BY (timestamp_);
-- drop table l_farm_referals
CREATE TABLE l_farm_referals
(
  attract_farm_hsh FixedString(64),
  ref_farm_hsh FixedString(64),
  timestamp_ DateTime,
  source_ String,
) ENGINE=MergeTree()
PARTITION BY (toYYYYMM(timestamp_))
ORDER BY (timestamp_);
-- drop table h_cities
CREATE TABLE h_cities
(
  city_code String,
  city_hsh FixedString(64),
  timestamp_ DateTime,
  source_ String,
) ENGINE=MergeTree()
PARTITION BY (toYYYYMM(timestamp_))
ORDER BY (timestamp_);
-- drop table l_city_farms
CREATE TABLE l_city_farms
(
  city_hsh FixedString(64),
  farm_hsh FixedString(64),
  timestamp_ DateTime,
  source_ String,
) ENGINE=MergeTree()
PARTITION BY (toYYYYMM(timestamp_))
ORDER BY (timestamp_);
-- drop table s_city_attrs
CREATE TABLE s_city_attrs
(
  city_hsh FixedString(64),
  timestamp_ DateTime,
  source_ String,
  name_ String,
) ENGINE=MergeTree()
PARTITION BY (toYYYYMM(timestamp_))
ORDER BY (timestamp_);

Well, then you fill the physical model with data. According to legend, you download them from the market, convert them into any convenient format and import them one-time into ClickHouse. ETL based on Airflow or any other technology can be implemented later, when the model passes initial tests and is presented to the business for evaluation.

A little more time will be spent here. But you love your job, so with the first rays of sunshine you happily perform the first Select from the tables.

Oh, and my grandfather gave birth to big-eared ones.

SQL. Checking the data
select count(1) as cnt from h_farms;
select count(1) as cnt from h_animals;
select * from h_animals;

In fact, I of course generated the data myself. And exactly as much as my small SSD on my home machine allowed. This process turned out to be so entertaining that another article was born in parallel – a Golang TSV generator for uploading to ClickHouse. This article will also be interesting to read for those who want to understand where the cities of Harsh Hog and Languid Slipper came from (see screenshots below).

JOIN IT!

It was not for nothing that I gave such a strange title to this chapter. I am sure that many who dealt with Data Vault thought from the very first lines – what a crazy person. This concept is, in principle, created for the primary data layer, and not for analytics, since sampling is not very fast and requires a huge number of join operations. So he also managed to put the storage into a columnar DBMS. Well, that’s nonsense.

And here we return to the beginning of the article. Data Vault is a pattern. And it doesn’t say – write join to make selections from the storage.

Let’s think for a moment in the context of DWH? How often do you see NESTED LOOP and other connections with indirect access to data via indexes? I think rarely, you’re not some kind of OLTP specialist (sarcasm). More often than not, you still work with large samples, which generate hash tables for the link. Even in a classic star, you are always forced to think about how not to knit many large tables. You work like this – there are many dimensions and a limited number of fact tables. Otherwise we might get caught.

Now let’s turn everything upside down:

Now it remains to understand, what is the difference between the two schemes? The difference is that in the second case we will select both tables, where we can cut them into partitions, and then combine them using GROUP BY.

We also remember that we are working in the context of a columnar database, so aggregations by columns and groupings do not scare us at all. You can look at speeds on hundreds of millions of records. Provided that the queries were executed on a virtual machine with 4GB of memory, I specially allocated as little as possible, but so that everything would not fail.

In the GROUP BY construct we use bundle fields, using aggregate functions we select values ​​that are not in one of the tables. After all, if there is a Hash in the h_cities and s_city_attr tables, then grouping by it and selecting maximum city ​​name, we will get the city name for a specific Hash. Well, developing this idea and working with subqueries, this is how you can count the number of rabbits in each city.

--select sum(rabbit_cnt) from (
select max(t.city_name) as city_name,
       sum(t.rabbit_cnt) as rabbit_cnt
  from (
	select ca.city_hsh,
	       ca.name_ as city_name,
	       null as farm_hash,
	       0 as rabbit_cnt
	  from s_city_attrs ca
	 union all
	select max(t1.city_hsh) as city_hsh,
	       null as city_name,
	       t1.farm_hsh,
	       sum(t1.rabbit_cnt) as rabbit_cnt
	  from (
	    select cf.city_hsh,
	           cf.farm_hsh,
	           null as animal_hsh,
	           0 as rabbit_cnt
	      from l_city_farms cf
	     union all
	    select null as city_hsh,
	           max(t2.farm_hsh) as farm_hsh,
	           t2.animal_hsh,
	           sum(t2.rabbit_cnt) as rabbit_cnt
	      from (
		    select af.farm_hsh,
		           af.animal_hsh,
		           0 as rabbit_cnt
		      from l_animal_farms af
		     union all
		    select null as farm_hsh,
		           t3.animal_hsh,
		           sum(t3.rabbit_cnt) as rabbit_cnt
		      from (
		        select null as farm_hsh,
			           a.animal_hsh,
			           0 as rabbit_cnt,
			           a.timestamp_ as actual_timestamp,
			           a.timestamp_
			      from h_animals a
			    union all
			    select null as farm_hsh,
			           al.animal_hsh,
			           1 as rabbit_cnt,
			           null as actual_timestamp,
			           al.timestamp_
			      from s_animal_lifecycle al
			     where al.status="Жив"
			  ) t3
			  group by t3.animal_hsh,
                       t3.timestamp_
			  having max(t3.actual_timestamp) = t3.timestamp_
		 ) t2
		 group by t2.animal_hsh
	  ) t1
	group by t1.farm_hsh
  ) t
 group by t.city_hsh
--)

Things to remember:

  • grouping fields are analogues of JOINs

  • the more tables we link, the more columns there will be with null, since the total set of columns increases, and not every table has them

  • in order to carry out intermediate connections with new Hubs, you will have to write subqueries deeper and deeper, grouping them by field of the connections. In the example above, this is a subquery l_city_farms and l_animal_farms, which is written to pass the city hash for each farm one level higher

Also, I didn’t focus on one point. When we linked the s_animal_lifecycle table, using the having construct and two timestamp fields, we selected a record that strictly corresponded to the Hub. Thus, we did without Point In Time tables and used aggregations.

The issue of finding the most effective referrals can generally be solved by using a query from one table l_farm_referal. But in order to get the farm number, I had to link h_farms.

The most effective farms for attracting
select max(t.farm_num) as farm_num,
       sum(t.cnt) as cnt
  from (
		select fr.attract_farm_hsh as farm_hsh,
		       count(1) as cnt,
		       null as farm_num
		  from l_farm_referals fr
		 group by fr.attract_farm_hsh
		 union all 
		select f.farm_hsh,
		       0 as cnt,
		       f.farm_num
		  from h_farms f
  ) t
group by t.farm_hsh
order by cnt desc

Conclusion

Business analysts, after years of having to receive similar data for 10-20 minutes from the processing database, were satisfied and are ready to continue counting the rabbits in your DWH.

Of course, I understand that the approach is not new. And it is often used even in databases with row storage in cases where a large amount of data still needs to be subtracted. But in columnar databases the approach is more effective due to the architecture of such DBMSs.

The advantages of the approach:

  • you can conduct primary analytics on the Data Vault itself – the speeds allow it. And create display cases as needed. In any case, they will be calculated faster than in a string database

  • very fast obtaining of facts on one Hub and associated Links and Satellites

  • we are almost independent of the number of rows in the database. Since analytics is carried out in columns. The main thing is not to write queries and selections with a large number of columns

  • Compatibility with DBMS data storage organization – grouping and aggregation are the primary task of columnar databases.

  • You can do without Point in time tables. Since we still aggregate data in subqueries

Of the minuses:

  • unusual rules for writing queries

  • a large number of subqueries when you need to knit new Hubs

  • a large number of null columns in complex queries

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *