I once worked as a data engineer at a startup. He grew rapidly and at some point decided to buy one large company. There were more than a hundred employees there – it turned out, almost all from India. While our developers were tinkering with exotic code that could not be deciphered at once, the Indian monitoring department caught my attention.
To monitor the network and servers, you can use a variety of software. The main thing is to be sure: if something goes wrong, an alert will work. You will see a notification and have time to fix everything. They had their own idea of monitoring. Imagine several rooms with tables, chairs and huge plasmas on the walls, almost two dozen people are sitting inside. On each monitor, multi-colored graphics are displayed and pieces of paper with phone numbers are glued next to it.
The job of these twenty people was to sit and look at the charts, and if the line suddenly turned out to be above the sticker pasted next to it, call the phone number that was written there and sound the alarm. Here they sat and watched without stopping.
The purchase of the company was inexpensive, but maintaining such an infrastructure cost exorbitant money. The Indians used the expensive Vertica, where, in addition to paying for the hardware, they also had to unbuckle for a license. We decided to try moving to ClickHouse. It is practically a free analogue of Vertica. Both products work on a similar principle: columnar DBMS with sharding, with data partitioning.
And that was another adventure.
ClickHouse killer feature – of course, saving money
Just what we needed. ClickHouse can write millions of events per second and also read quickly. If you organize the aggregation table correctly in ClickHouse, it will be executed in a minute, while in classic aggregation databases (for example, PostgreSQL) a query is executed for an hour.
I fired up ClickHouse on one case when I ran a compression test with brand new codecs: the same DoubleDelta stung our semi-random sequences five times! And ClickHouse is not confused by the amount of data if you choose the correct sort key and the level of detail of the index in the MergeTree settings.
As a result, I saw that the data that took up gigabytes on HDFS fit into 700 megabytes in ClickHouse, and the query speed for them was an order of magnitude higher. Especially when it turns out to be successful in rewriting a standard “hacker” SQL query for a thousand lines into something as clear as ClickHouse, using correct built-in functions, arrays, aggregates and other suitable alternatives.
But at that time the instrument had a big drawback – a high entry threshold. The result of a combination of a small community at that time, not much outside the CIS, the fundamental differences of ClickHouse from “ordinary” DBMS and vague documentation. I had to conduct personal experiments: load TSBS into it and experiment with functions, different versions, engine settings, and so on – it even came to compilation flags. The driver existed for show – it did not use http protocols natively, just a wrapper over the Rest client.
ClickHouse was gradually improving, there were many bugs, many critical features were missing, alternatives are more convenient to use and easier to maintain, but they all cost many times more. Klickhaus is open source and powerful, but you can’t figure it out without a glass. It looks like a stereotypical Russian thing – imperfect, rough, cheap. But the compression and the speed of work cover everything.
How to build a backend so that nothing falls, does not gobble up a ton of money, can store and provide access to reports?
Provided there are billions of data lines, terabytes, kilometers.
The company was engaged in the monetization of pirates. A person pirated an android application, did not pay for it, but saw an advertisement. In order to calculate the billing client, see how many advertisements he clicked on, the company began to collect statistics. Before my arrival, there was written its own SDK and backend. PHP accepted the JSON event, parsed it and wrote it in MySQL. There were no connection pools, an event came – they opened and recorded.
When the load began to grow (several thousand events per second), this system stopped exporting. The backend developer found out about Hadoop, HDFS and decided to apply it. Assembled a cluster of several machines. The idea was this: we just write JSON files, push them into Hive. And now everything is considered and works.
When the guys started migrating billing to Hive, they realized that their check per cluster had grown dramatically. Everything was stored as uncompressed JSON files. In addition, HDFS and Hadoop were not geared towards real-time computing. I had to plan any job in advance. We bet all night, in the morning we saw the result in the form of a huge heap of unstructured data that lies right in the text. And all this for money! There is no speed, it takes a long time to make a request, and there is a dump at the exit. This did not suit anyone. When I began to figure out how the current architecture of the project works, it turned out that Spark was being used offline on several nodes, which looked suspicious and specific. Having figured out the startup script, I realized that the current settings led to the fact that the nodes were loaded for all two hundred, and the RDDs were still read in one thread.
After some investigation, I found out that this was not done due to some architectural constraints. The same relatively large HDFS was correctly configured as a cluster, with redundancy – there were no complaints about this part of the infrastructure and hardware.
However, if the jobs were parallelized correctly, the output was not quite correct. The documentation on the code, as is traditionally in our area, was outdated, so I had to concentrate as much as possible in order to effectively light up the current Java source code, understand the desired business logic and fix the bug. The result was a completely new Scala application that worked so much better that the infrastructure department was able to plan for thousands of dollars in backend support costs!
Finally, we started adding an entry to ClickHouse. The source code of the server will be easy enough for anyone who knows C ++ to understand. It is also sometimes covered in tests, and if not, then you can often find a test that implements a similar case. So, as a result, we had our own fault-tolerant driver in Scala, working over TCP, in a binary Native format, taking into account the cluster configuration and allowing us to write in multiple threads as efficiently as possible.
I decided to further refine the backend and took a risk – removed HDFS. I installed a normal Kafka cluster, the backend began to write to it, everything from Kafka was read by the consumer and written in ClickHouse. At the same time, I did not exclude the HDFS cluster from the schema altogether, making a kind of backup out of it: the data flew into the trimmed HDFS too.
At some point, the task arose to transfer all other services and reports to work with ClickHouse. Most were translated in the process of creating a new application, but there was still a lot of work to be done regarding where this data was planned to be integrated in the future.
But the main problem was not in ClickHouse itself – but in the fact that no one could figure it out right away
If in order to learn how to effectively write in ClickHouse and create a production-ready service, only my investments were enough, then with requests and, in general, with the full use of DWH, the situation was somewhat different.
From experience, it turned out that the tasks that the analytics department and developers could previously solve “at a snap” in some Hive or MySQL were not solved in ClickHouse right away – somewhere the function must be used, somewhere JOIN due to data distribution incorrect, etc.
And it’s also good if the developer suspected something was wrong with the request and turned, if not to the documentation, then at least to the chat. But it also happened that the errors during the transfer remained hidden. In such cases, everything became clear even when it was too late – the data either got into the reports, or somehow shone in the client’s offices.
Of course, this can be said about any relatively young technology. But in our case, there were also bugs when, according to the documentation and reviews, the request should work, and it led to a server crash or an incorrect result.
For example, you can recall early support for UUIDs when a query like this:
``` SELECT * FROM db PREWHERE uuid != '00000000-0000-0000-0000-000000000000' ```
Lead to segfault.
Or when you could accidentally lose data in the process of deduplicating replicated partitions. The community will have a lot of unique cases! Therefore, it turns out that sometimes it is even more useful and it is enough to search for your problem in GitHub, in Issues, by keywords.
And then it turned out that the data from ClickHouse is too difficult to visualize
The same Hive offered tools that allowed you to write queries and do visualizations. We used Tabix at first. This is a rather primitive client, it goes directly to ClickHouse and displays the result in the browser in the form of a table and simple graphs. But in the end they took a more interesting instrument. Redash has just rolled out a beta version of ClickHouse support.
In parallel, we have a piece of data based on Redshift. At some point, the company offered to take data from ClickHouse and transfer it to Redshift (which is, in fact, ordinary SQL and you can use the desired visualization tools). But the idea disappeared by itself – I just calculated how much the Redshift cluster would cost, which would support such a data flow from ClickHouse. We spent about a thousand dollars, if we got confused with Redshift – we would pay all 30 thousand. Therefore, we continued to work with ClickHouse through Redash.
But everything went to pieces when we decided to attach a Tableau to ClickHouse and in the end flew in 70 thousand dollars!
Tableau is a well-known visualization for databases. When we decided to do this, there was no official ClickHouse support there. But they did have PostgreSQL support. Some madman suggested that you can write custom scripts and forwarding protocols in PostgreSQL. Cross ClickHouse and PostgreSQL through a python script and achieve visualization in Tableau! This was another failed idea. The data came only with PostgreSQL. There was no sense in this.
But our analyst insisted. He liked Tableau – it’s beautiful, understandable, you don’t need to do any SQL queries, everything is with buttons. And they bought it for us. We spent about 70 and started dancing with a tambourine around the purchase.
We even phoned the salespeople from Tableau and convinced them to match their product with the latest Russian technology. Even then, it was clear that ClickHouse would slowly conquer the whole world. They just listened to us politely. Funnily enough, Tableau recently made some driver support for ClickHouse. Less than two years have passed!
We did not manage to transfer reports from Redash to Tableau – the money was gone. It’s good that by that time some of the employees had mastered ClickHouse – I was slowly teaching everyone. It turned out to be convenient for android developers and offers even more options than the handsome Tableau.
The biggest problem with ClickHouse is that no one fumbles about it. There are many pitfalls that are not described anywhere. Good technology with poor documentation.
There is a need to use it, but there is no time to figure out how it works. Therefore, various incidents happen, like entire companies that make money on ClickHouse consulting.
Now it has become much simpler, but it is still not a product that you can use right out of the box. But when he becomes like that, every Vertica and Redshift can wind up their business.