How the data management platform in Yandex Go has evolved. Yandex report

Since 2017, we have been building and constantly adapting a data management platform to changing requirements and external factors. For us, this is infrastructure that lives and is reused within various services that are included in Yandex Go: Taxi, Food, Shop and Delivery.

First we'll discuss the scope of what we consider the platform to be and what we do. Next I’ll tell you about architecture YTsaurus (YT), designed for processing and storing data, and how it developed on the Yandex Go side. I will talk about YT from the point of view of what opportunities it gives to storage and platform developers, analysts and other users who actively work with data.

Below is my report that I gave at the conference SmartData.

DAMA BOOK and data life cycle

In our work, we rely not only on our experience or the experience of our colleagues, but also on various best practices. For example, DAMA‑DMBOK is a book in which the correct principles of data management are well formulated.

Excerpt from DAMA-BOOK about Data Management

Excerpt from DAMA-BOOK about Data Management

From the principles of DAMA‑BOOK, it can be noted that specialists involved in data management should focus on the entire life cycle of the product, that is, on everything that happens with the data.

We also suggest taking the goals of what we do from DMBOK:

  • focus on the different needs of its users;

  • provide tools for data processing;

  • ensure data quality;

  • take care of safety.

An important aspect of the data life cycle is that everything that happens to it fits into the classic cycle. Planning → design → creation → storage, support, improving the experience of using and processing data → and planning again.

Data Lifecycle

Data Lifecycle

We will not focus on all stages of the data lifecycle, but on the especially expensive ones: processing, storage and use.

How the Yandex Go data storage system developed

In 2017, instead of Yandex Go there was simply Yandex Taxi. I was the sixth person on the warehouse analytics and development team. We already had dozens of sources and terabytes of data. And volumes grew exponentially. Our goal was to make replicas without a backend and showcase for the main entities, as well as to establish basic reporting and dashboards.

Yandex already had YT back then, and we created the simplest possible architecture based on it. That is, we loaded data from sources into staging and created some basic showcases. Analysts could solve their problems in so-called sandboxes in isolation from the production process. For reports there was an in-house solution integrated with YT.

Tools quickly became scarce. Taxi was growing, the number of services in the backend was increasing, the subject area was developing, and it was necessary to quickly meet the needs of the growing analytics team.

The next challenge was to provide quick access to existing storefronts, learn how to make more complex aggregates, and reduce the delay of data from the source. And since the service was actively growing, we could not immediately guess the requirements well and accurately.

And then ClickHouse® appeared. It solved the problem of fast ad-hoc access, and it was possible to read complex aggregates for reporting.

We also realized that we couldn’t go to the source every time for a story, so we created an operational data layer, or ODS, and the storefront began to be built just on top of the ODS. The need for constant recalculations of history began to be solved due to the presence of a layer with “raw” data.

We began to think about how to scale our solutions to a large product and a large team that continued to grow. It was important for us to ensure a reduction in TTM development and to provide the opportunity to solve some problems through Self Service.

The first solution was Greenplum® – an MPP DBMS where an ETL developer or analyst can simply write an SQL query instead of a MapReduce job and thereby speed up development. Greenplum® also solves the problem of performance in certain scenarios, because the scalability of a MapReduce system is not always and not everywhere needed. In addition, Greenplum® allowed us to start working with standard BI tools known on the market: for example, Tableau and MS SSAS. At some point, we no longer needed ClickHouse®, because the combination of Greenplum® and Tableau completely replaced all the scenarios that had previously been solved in the ClickHouse cluster.

Then, in addition to Yandex Taxi, we got Food, Shop, Delivery. And again, it was important not to drop Time to Market; we needed to figure out how to adapt the architecture to new challenges. At this moment we have a detailed layer. It is built on a paradigm that combines the properties of anchor modeling and Data Vault. More information about hNhM can be found in article by Zhenya Ermakov.

At the turn of 2019 and 2020, we were able to adapt Apache Spark™ to work with YTsaurus. At that time, only MapReduce and YQL (SQL dialect) on top of MapReduce were available in YT for batch calculations, and we had a request for a faster computing engine.

By 2022, our huge central DWH was divided into several storage facilities for individual business units. Each of them began to focus on the applied tasks of their service. And, accordingly, the team of the platform, whose development I am involved in, stood out.

In 2022, we had to quickly retire Tableau. We were moving to DataLens, and for it to work, it was necessary to provide the ability to build live reports. This scheme works poorly on top of Greenplum®, especially when the Greenplum cluster is very heavily loaded, like ours. That's why ClickHouse® has returned to our stack.

So, at the moment we have a Data Lake on YTsaurus, we have a Greenplum® data warehouse. There are ClickHouse®, DataLens and OLAP cubes (MS SSAS), which continue to help the company solve its problems.

Timeline of Yandex Go development history

Timeline of Yandex Go development history

In our history, you can see that the most stable part of the entire infrastructure is YTsaurus. We have been working with it since the very beginning of the project, and there have been minimal changes in its use cases. That's why I want to talk about him separately.

Data storage solution in Yandex Go: YTsaurus

Main components of YTsaurus:

  • Cypress (“Cypress”) is a distributed file system and metainformation repository.

  • Resource scheduler for MapReduce and other types of computation.

  • Various computing engines.

YTsaurus architecture diagram

YTsaurus architecture diagram

I would like to emphasize that YT has a convenient interface from which you can find out everything about processes, data, where things fail and fail, how many resources are spent, and so on. It solves both the tasks of the cluster user and the tasks of the administrator.

YTsaurus interface

YTsaurus interface

Data Storage: Cypress

YT has static and dynamic tables. Static files can be simply considered files that lie somewhere in a distributed file system and are used for “large” batch calculations. But at the same time, it is possible to specify the table schema, including those with “complex” types (arrays, dictionaries, etc.). You can control the sorting of data and ensure the uniqueness of data based on the sort key. We use static tables to store logs or business data that does not change too often.

Dynamic tables are key-value stores. We use dynamic tables in ETL processes to build ODS (Operational Data Store) with minimal latency from the source.

Both types of tables can have row or column storage.

There are many options to control the size of tables:

Data can be compressed, there is a large selection of codecs, and you can safely choose what is more important: saving disk space or spending more CPU. You can control TTL (Time To Live), that is, you can indicate that the data in the system needs to be cleared at some point. And there are different settings for storage redundancy: replication and erasure coding. You can emulate partitioning by using folders and tables inside them.

We actively use various codecs for data compression and erasure coding to save disk space when storing rarely used historical business data and logs.

In YT, within the framework of one transaction, we can, for example, take, calculate some heavy large tablet or even several tables over the course of a week or even a longer period of time, and then replace it unnoticed by the user. This is cool, because no one sees this moment, everyone continues to work fine with the data, read requests do not fall, nothing crashes, everyone is happy. This is achieved through a well-developed locking system. For example, there is a snapshot lock, which guarantees that as part of your operation you will see the snapshot of data that was at the time it began, and will not suffer from the fact that it has somehow changed or disappeared.

We actively use this feature during large-scale recalculations of history. Thanks to this approach, historical recalculations do not cause pain for data users.

In a large company, it is important to be able to manage everything possible within one system so that people cannot damage production processes. YT has hierarchical accounts with which you can manage disk storage and regulate the number of objects. This is convenient and useful when there are a large number of users and critical when there are thousands of them. Quotas are visible and managed through the interface. Due to quotas, the problem of isolating users from each other, production processes from ad-hoc, etc. is solved.

Within our businesses, different accounts are used for production processes of different owners. In some cases, we allow quotas to be overcommitted in order to better utilize them.

YT has a developed access control system. Each object can have its own access control list, or ACL, which specifies the list or groups of users who can have access to this data.

Next, we'll look at different computing engines.

Compute

How, why and why to use MapReduce nowadays

How, why and why to use MapReduce nowadays

And let's start with MapReduce. Let's imagine our scenario for delivering data to the operational layer – this is some on-the-fly transformation. Let’s say we need to count it entirely, and going to the source to do this is very energy-intensive. Then we strictly regulate what can be in this transformation, and run this transformation in the form of a map operation on YT on top of the raw data to recalculate the history. It’s convenient and cheap, because you don’t need some kind of super-high data processing speed, but rather efficiency and reliability. This way we get a system in which, in principle, it is not always necessary to turn off the current supply to recalculate the history.

YT has a SQL-like language called YQL. It can run on top of MapReduce and use an engine that performs in-memory calculations under certain conditions. It can be supplemented with your own UDFs, for example those written in C++ or Python. YQL has a very convenient interface with a history of all queries and the ability to “share” them between users.

We use YQL to process the thickest logs and calculate storefronts that do not require special efficiency.

SPYT powered by Apache Spark

Spark over YT

Spark over YT

Spark appeared in Yandex at the turn of 2019–2020 on the initiative of our team. In YT, it can work as a Spark Standalone Cluster, where Spark manages its own resources and lives entirely inside YT. Standalone Cluster supports Client Mode and Cluster Mode. In addition, starting from SPYT 1.76.0it is supported to launch Spark applications directly through the YTsaurus scheduler without the need to launch a Standalone cluster.

Spark over YT: Client Mode

Spark over YT: Client Mode

We actively use Spark for processing increments, calculating storefronts with complex business logic, and other scenarios where speed is required and the data fits in the memory that is available to us.

Bl-scripts

Bl over YT

Bl over YT

CHYT powered by ClickHouse® is slightly slower than regular local ClickHouse®, but for many reports its performance is more than enough. It is also important to note that the speed of CHYT is more accurately compared to ClickHouse® Cloud or ClickHouse® over S3, and CHYT is comparable to them. As a result, you can solve BI scenarios and do ad-hoc analytics without copying data to faster storage.

CHYT is actively used for reporting in Taxi, Food, Shop and Delivery. Only the most critical and popular reporting data is replicated to the local ClickHouse®.

Scheduler and pools

All of the above works due to a well-developed scheduler and a system of computing pools.

The scheduler manages resources; pools can be built into a hierarchy that guarantees isolation and reuse of resources within the tree, and this is convenient because some pools can be prioritized based on their weight.

The picture above shows that within one business we are creating one large root computing pool, which is then distributed among different users with different load patterns. Each large user within his pool can create additional computing pools with their own guarantees to fine-tune their processes.

We can summarize that YTsaurus is a platform within which many useful tools for various scenarios are tightly integrated. We plan to continue to actively use YT. I hope that soon we will be ready to talk about the use of YTsaurus in stream data processing scenarios and its integration with Apache Flink®.


If we talk about a platform, that is, a complex product, it is important to rely on the experience that you and your colleagues have, and not forget about best practice.

It is important to start from the requirements: there is no point in creating some kind of mega-complex architecture right away; it can be developed gradually. Our example shows that it works, and there is no need to be afraid to experiment. Experiments are definitely the path to success 🙂

Thank you for your attention, and share your experience or ask questions in the comments.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *