How Yandex created its data bus to transfer hundreds of gigabytes per second

10 years ago, hundreds of Yandex servers ran on Apache Kafka®, but we didn’t like everything about this product. Our tasks required a single bus for transmitting all types of data: from billing to application logs. Today the volumes have already reached tens of thousands of named sets of messages.

With so much data in Apache Kafka®, it became difficult to manage access rights, organize distributed work of several teams, and much more. Problems of growth and the lack of a suitable solution in the public domain led us to develop our YDB Topics solution and make it open source as part of YDB data platforms. In this post I will talk about the prerequisites for creating the product, our data transfer architecture, emerging challenges and opportunities that appeared with YDB Topics.

Why do we need our own tire?

Here are just a few of the challenges that have arisen when using Apache Kafka®:

  1. No quotas. The team could not be allocated a small piece of the cluster. Only two solutions were possible: a large cluster, which was expensive, or a common cluster for several teams, which was not very supported by Kafka capabilities. The clusters themselves were not very stable: sometimes we observed the loss of some parts of the transmitted data.

  2. Prevent users from directly accessing shared clusters. Because of this, a whole team of system administrators had to sort out requests for setting up rights that users left in Yandex Tracker.

First of all, these inconveniences prompted us to think about our own data transmission bus. By that time, the YDB platform made it possible to reliably store data and already implemented mechanisms such as distributed consensus and failover. So we decided to use ready-made technologies and simply create our own bus based on YDB – and that’s how the product appeared YDB Topics. We hoped that by making this solution publicly available, it would encourage the community to collaborate in developing the data ecosystem. Therefore, we decided to open source it and help meet the needs of large companies and hyperscalers. I’ll show you for what tasks this may be relevant.

Data transmission architecture in Yandex

The architecture of data transfer in Yandex is built around three typical scenarios:

  1. Data transfer between applications located in several data centers (DC). Applications themselves generate traffic and coordinate the load themselves. At the same time, applications know that there are several data centers, that data centers can fail, and they take this into account in their work. During operation, our control module monitors the load and can ask the client to rebalance the flow between different DCs.

  2. Transfer of billing data. Such data must not be lost, duplicated or changed in its order (for example, in banking systems the sequence of deposits and debits from accounts is important). Users transmit billing data to the transfer bus, and we replicate it across three availability zones within a single cross-data center cluster. Such a system can withstand the simultaneous shutdown of an entire data center and a server rack in another DC and guarantees high availability of reading and writing data.

  1. Transfer application logs and data in real time. This is almost the same as application data transfer, but the real-time data transfer speed guarantees are lower.

The scenario when users know which data center they are working with is called “federation”: this is an association of individual data centers. Federation is much cheaper than cross-DC: Erasure coding is used for storage, similar to codes Reed – Solomon, rather than complete data replication. And since there is no need to guarantee the order of data when reading between data centers, duplicates are allowed and readability is lower, which applications are initially prepared for. When one data center goes down, traffic is distributed among the others.

What are Reed – Solomon codes? If you have four blocks of data, you can add two blocks of checksum to them and survive the shutdown of any two nodes in this system. Unlike replication in cross-DC mode, in which we stored three copies of data, in the case of Erasure encoding we store data with redundancy of only 1.5, and this is very significant for our data volumes. Such a system guarantees resiliency: even if two of the six nodes fail, it continues to work, and users do not notice anything. Depending on the configuration, we either provide full exactly-once guarantees with order, high availability and transactional processing, or lower guarantees that will accurately deliver your data, but may result in rare duplicates, plus sometimes slightly reduced read availability. In this case, the SLA of both systems will be high in any case within the framework of their failure models.

Why the YDB Topics solution is useful here

The YDB Topics data bus is able to transmit heterogeneous information with a high guarantee of safety. You can deploy it anywhere: on Kubernetes® clusters, virtual machines, and in a Docker container for developers. YDB Topics is part of the open source YDB data platforms — a disaster-resistant database that scales to thousands of nodes.

Here's what the data bus looks like in numbers:

This huge flow of data can be managed by just one on-duty SRE specialist. I'll tell you a little more.

Architecture of YDB Topics, Kafka and Pulsar

Data buses are actively used in the community Apache Kafka®, Apache Pulsar™so it’s worth a top-level comparison of their architecture and the architecture of YDB Topics.

Apache Kafka® and Apache Pulsar™ have management modules that tell users which server to work with. Often this role uses a system like Apache ZooKeeper™, which is responsible for electing a leader, and then the incoming data is transferred to the storage node using internal replication mechanisms.

Apache Kafka® works exactly like this: ZooKeeper™ elects a leader and the data is replicated.

Apache Pulsar™ takes a different approach: data storage is implemented using a separate service Apache BookKeeper. Pulsar™ transfers data to it, asks it to be stored securely, and uses servers nearby ZooKeeper™which help Pulsar™ choose which server will be the leader and provide fault tolerance.

YDB Topics is similar in some ways to these systems, but different in others. We also store data not locally, but in a separate system called YDB BlobStorage. This component accepts data streams and securely places them on storage nodes, depending on the availability of data centers, servers and nodes, information about which it has full knowledge of. BlobStorage handles node failures and ensures constant data availability. Depending on the settings, it can either store data in three DCs, making full copies, or, using Erasure coding, reduce the volume and physically store only one and a half volumes of data.

Property

YDB Topics

Kafka®

Pulsar™

Storage method

Dedicated, YDB BlobStorage

Local data storage

Dedicated, Apache BookKeeper

Replication method and storage ratio

Block-4-2, 1.5x
Mirror-3dc, 3x

Replication, 3x

Replication, 3x

Features of data storage in YDB Topics:

  • The minimum allowed storage time in YDB Topics installations in Yandex is 18 hours.

  • Data from the most critical services is stored for 36+ hours.

  • The main limiting factor is disk capacity. We switched from hard drives to NVME, gained in the number of I/O operations, which allowed us to reduce the number of servers from a thousand to several hundred, but the disk capacity was not enough.

  • Total storage volume: federation – more than 20 PB, cross-DC – more than 1 PB.

A standard cluster in YDB Topics consists of several hundred heterogeneous hosts, each of which:

This is a typical configuration: some servers are more powerful, others are weaker. Since we store system software and logs separately from data, we can determine the number of available cores and the amount of free space and rebalance the system. We trust more powerful servers with more computing load, and we store more data on servers with more disks.

In general, the largest clusters of the YDB platform have more than 10 thousand nodes. It is possible to monitor everything with the help of one on-duty SRE specialist mainly due to the fact that all operations of interaction with entities of the YDB platform are transferred to the users themselves. In the case of topics, users themselves manage everything they need for work: they create topics and delete them, change settings and manage ACLs, receive new capacities and give away old ones. Moreover, they can do this at any time, even if the YDB Topics team is rolling out a new release to the same cluster at the same time. In other services of the YDB platform, all operations are also delegated to users. This is what allows us to focus on other tasks.

Protocols, security and monitoring

Open source solution supports its own protocol YDB Topics and Apache Kafka®, and the solution for Yandex Cloud is also a protocol Amazon Kinesis.

All data within Yandex is transmitted in encrypted form. We conduct regular security audits and check the quality of the code. We monitor the state of the system and manage it using the web console and through SQL queries.

What's next

Our plans:

  1. Optimizing speed and network traffic.

  2. Data schematization. Historically, Yandex transmitted and processed binary data sets, but as the company grew, it became more difficult to understand what data was processed where. This is why we are moving towards data cataloging.

  3. Increased support for Apache Kafka® API. Data can already be supplied and read using the Apache Kafka® protocol; a lot of work has been done to integrate with Kafka Connect, where we support work in Kafka Connect standalone mode. At the same time, we are actively engaged in increasing the degree of integration and plan to also support work through KSQL.

We are open to suggestions! If you have any ideas on how to improve YDB Topics, please come and share: create feature requests in Issues on Github or write to us at chat community.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *