Today, data and everything related to it (ML, AI, DataMining, etc) is the most hype trend in the IT industry. Everyone – from retailers to Elon Musk’s companies – works (or tries to work) with data. At Leroy Merlin, this wave has not spared us – the data-driven approach to decision-making is one of the main in the company. Following it, we created our own data platform, which is currently used by about 2 thousand people, and about 1800 requests are processed per minute. In this article, we (the Data-team of Leroy Merlin Russia) will tell you how in 2 years we built a data platform in a company with a lot of offline processes, about its architecture and the experience that we got in the process of creation.
How it all began
In 2016, long before the platform appeared, a tool was needed to build reports, conduct analytics and draw charts. Then we decided to simply link all the main databases of the company with DB-links. The solution turned out not to be scalable, while the need for analytical reports grew, as did the requirements for them, while they influenced all business processes of the company.
In 2017, we decided to go to a large vendor: buy one or two iron cabinets from him, fill in our data in the same way and continue to build reports. But this turned out to be also problematic from the point of view of scalability, since the place is running out, and you need to go to the vendor again, buy a new cabinet and pay a lot of money for it. Before giving this money, I wanted to see what big data-driven companies are doing in the world, which not only make data-based decisions, but also make predictions. After looking at them, the company decided to create its own data platform.
First steps to your platform
To do this, it was first necessary to collect all the data in a single repository and put data engineers, data analysts and data scientists into it so that they could build their models in it, train them, draw storefronts, build and write services for work. with data. In short, benefit our users (= company employees).
We expected it to be very expensive. To avoid another purchase of new hardware, we needed to develop a strategy for our actions and understand how we will move on.
First of all, we have developed 3 basic principles of work on our future platform:
Expertise within the company. All the knowledge gained during the creation of the platform should be consolidated by the Leroy Merlin team;
Using opensource. Make the most of the developments of the community and develop them, and not spend money on proprietary products;
Cloud native & cloud agnostic. We must build a platform that can be deployed in the clouds, but not tied to a specific cloud solution.
Here you can debate and discuss theses from each point for a long time, but we are techies, and the business gave us the helm and allowed us to completely control the process of creating the platform. And it was clear to us, as technicians, that by adhering to these rules, we can create and develop the platform ourselves at the speed that we need:
We are not tied to vendors – if we need something, we can write it ourselves and contribute.
We are not tied to contractors – we ourselves know what, how and where to write.
We are not tied to infrastructure – there is no need to order equipment and wait six months for it to be delivered and installed.
Data platform for 2 years
So in 2019 we started working on creating a data platform.
To begin with, it was necessary to decide on the storage. If you are still at the stage of choosing a repository, we can advise you to watch the report from Maxim Statsenko from Ya.Go (An overview of big data storage technologies).
We didn’t have much choice here – Hadoop Stack (Kudu, Druid, Hive, etc), Clickhouse or Greenplum. The rest of the solutions (Vertica, Teradata, Exasol) did not fit because of their proprietary nature and requirements for the hardware infrastructure. In the end, we settled on GP – because of its ability to work with ANSI-sql and transparency for users – for them it’s just a big Postgres, in which all their reports built on previous solutions worked out of the box, and for CH and Hive I would have to rewrite everything.
At the first stage, we needed to integrate relational tools into the platform in order to enable users to build their reports not on the basis of sources, but on our own. We counted them and realized that our company has over 350 relational databases. The mechanism was set up on the source bases CDC and with the help of Debezium, we began to load this data into Kafka, which we raked with an ETL NiFi tool and loaded everything into the GreenPlum raw layer.
We ran our MVP on virtual machines in the cloud.
Since the MVP has shown good results, we found the movement vector. We had a feeling that we would spend a lot of money on clouds due to the fact that we have a huge amount of processed data, and it is growing every day. Therefore, we purchased iron servers and went to onprem.
What configurations were there?
There were 24 HP DL380 servers, we stuck 12 2 TB screws into each server, assembled a 220 TB cluster (it turned out to fit 110 TB of useful data into it), it had 800 CPU cores and more than 5 TB of RAM:
HPE DL380 Gen10 5115 Xeon-G (Intel (R) Xeon (R) Gold 5115 CPU @ 2.40GHz)
256GB Memory (HPE 32GB 2Rx4 PC4-2666V-R)
10TB RAID10 software array (HPE 1.8TB SAS 10K SFF SC 512e)
4x10GB bonded network interfaces
But, before launching, we decided to compare virtual machines in the cloud with what we can squeeze out on hardware.
About Object Storage
On the X axis – the number of threads (how many we ran the test for writing and reading). The Y-axis is the final speed. Maximum performance is achieved on 40 threads with objects larger than 32 MB.
For the GreenPlum tests, as a result, we got about 20% performance gain on hardware (with the same configurations). We also tested various GP versions (6.x versus 5.x), and tested various system configurations.
More information about the test methodology and the calculation of the final score can be found in the specification: http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.11.0.pdf
We made the following conclusions from the Greenplum tests:
The main impact on GP performance is the speed of the disk subsystem, primarily IOPS
When installing GP in the cloud, you need to lay 20% additional resources (CPU, memory)
On virtual machines, you need to set a smaller vm.overcommit_memory (90, instead of 95), because high CPU consumption on cloud infrastructure has a negative impact on the network between segment nodes (packets are lost, responses are slowed down), which can lead to segments falling out of the cluster
The performance of a cluster in the cloud with a large number of small virtual machines is higher than with a small number of large ones – it is better to take 18 virtual machines with 5 segments each, than 5 virtual computers with 18 segments.
Using Kafka, we measured the read / write speed and got 350 MB for writing and 400 MB for reading on ordinary HDD disks. Feature of kafka – sequential write and read from disk allows it to work very quickly on budget hardware. For most tasks, these speeds are enough for the eyes.
Prod on lamba – expectation and reality
After our tests, we happily took everything and ran to the iron cluster. We thought that everything would be quick and easy. But then the problems began.
It turned out that you can’t just go and move to iron. Problems began: disks were flying out of cool branded servers (although the vendor changed them quickly, performance and fault tolerance degraded anyway). We also experienced problems with specific network settings for GreenPlum. We encountered bugs in HP’s bios, the nodes rebooted randomly – we had to update the bios. We also noticed that HP turns on the green mode on the CPU by default, and under heavy load, our cluster throttled, and the performance was not maximum.
And in parallel with this, while the team was engaged in stabilizing sales, our backlog grew, and, instead of developing the product itself, we spent a lot of time on the hardware itself. At the same time, the business demanded an SLA and wanted their requests to be fulfilled always quickly.
Then we thought about how to ensure the continuity of the business, even if a meteorite falls on our data center.
The only solution in this case was to have dual ETL – double infrastructure, double data, spill twice. The question arose about buying another iron cabinet. And then we realized that we need to go back to the clouds.
It should be noted that while we were developing hardware, we also did not stand still on the issue of clouds, deploying a large number of virtual machines for our test and stage environments, for incubation environments.
We have deployed a bunch of managed services in the cloud. We did all this with the help of a Swiss devops knife – Terraform (IaC), Ansible, Jenkins. We also developed a systematic approach to working with infrastructure (iron / cloud / virtual) – we use ELK to collect logs, Consul as a discovery service, collect metrics in Prometheus, and display them in Grafana.
We started transforming our entire company into data-driven. At the moment, we are still at the very start, but already now we see that data-driven products bring considerable profit.
We built our platform entirely on open-source solutions
Cloud providers are helping to provide a quick start for new products. But it must be borne in mind that you cannot thoughtlessly use all cloud services, especially managed ones, especially specific ones and which are not available in other clouds, because there is a chance of getting a cloud lock. Simply put, you gotta be cloud-agnostic
Hybrid clouds are a modern trend that allows companies to choose a cloud based on tasks and criteria (price, availability, convenience, security). And the use of IaC (Infrastructure as Code) simplifies this task.
In the next article we will tell you more about the current architecture of our platform, its components and open source products that we use – experience with GreenPlum, Airflow, Apache Superset, Flink and many others.