How we implemented CockroachDB on DBaaS in a classic DBMS company

Hello! My name is Polina Kudryavtseva, I DBA engineer V Avito. My team and I develop and support database management systems (DBMS) PostgreSQL and CockroachDB on the DBaaS platform. If everything is more or less clear with PostgreSQL, then many questions may arise regarding the CockroachDB smoothie technology. Today I’ll tell you how this “cockroach” crawled and took up residence in Avito.

What's inside the article:

World of databases in Avito before CockroachDB

How we came to Cockroach DB

How data is distributed

How the request is performed

How CockroachDB minimizes latency

What CockroachDB looks like on the DBaaS platform in Avito

Cluster topology

Creating a database

What are we planning to do next?

The text is based on my speech NEmitap Database#1 – Cockroach DB on the DBaaS platform.

World of databases in Avito before CockroachDB

For a long time, only classical database management systems (DBMS) lived in Avito. Among them:

document-oriented DBMS MongoDB for storing kilometer-long JSON;
key-value Redis DBMS – first aid for working with caches;
Columnar analytical DBMS ClickHouse for OLAP workload;
Elasticsearch search engine;
Ceph object storage, used, in particular, for backups;
message brokers Kafka and RedPanda, indispensable for implementing queues;
the well-known PostgreSQL.

Some of these DBMSs are fully supported on the platform, others are almost manually driven and minimally automated. All of them are supported by different teams and each technology has its own nuances of setup, support, life cycle and debugging conditions. These DBMS solve different problems and are far from each other, like stars in space. However, they existed together for a long time and made the life of product developers better and more fun. What was missing?

How we came to CockroachDB

In order for the service to withstand 99.99% reliability, it is necessary that the DBMS can survive peak loads of up to x10 and quickly scale to new data centers.

Accordingly, special tools are needed – at that moment there were none on the platform. The selection criteria were as follows: availability of automatic scaling, fault tolerance, replication between regions

What else is important for this tool:

support for data consistency and integrity, provision of parallel access, atomicity, error recovery and ACID properties;
PostgreSQL-compatible protocol, since many developers in Avito already have experience using PostgreSQL to write services;
the presence of a Go client, because the main language for writing services in Avito is Go.

All these criteria converged on CockroachDB – it is a distributed open source DBMS written in Go and supporting SQL. Below are its main advantages.

Isolation level Serializable. It guarantees that the result will be the same, no matter how the transactions are executed – in parallel or sequentially. This ensures the correctness of the data and prevents any “anomalies” allowed by weaker isolation levels.

Suitable for services with high requirements for fault tolerance. CockroachDB continues to work even if nodes or data centers fail due to its data replication features. Database updates also occur without downtime.

Horizontal scaling out of the box. It is enough to add new nodes to the cluster and the system itself will redistribute not only the load, but also the data along the new topology. No third party solutions needed. In addition, CockroachDB easily integrates with container orchestrators and schedulers such as Kubernetes (k8s).

MVCC support. CockroachDB provides the ability to access data in parallel without breaking data consistency by storing multiple versions of rows. Row versions that are not used by any transaction are subsequently garbage collected. In this case, the disk space they occupy is physically freed.

How data is distributed in CockroachDB

When choosing a DBMS for a specific task, it is important to understand how data is distributed. Let's say there is a monolithic table with a primary key and one column. At a low level, CockroachDB divides the table into approximately equal parts – they are called ranges.

How CockroachDB divides data into ranges

Each range is copied as many times as specified in the replication factor in the DBMS settings. The replication factor cannot be lower than three, that is, each range is represented in at least three copies (one master and two replicas).

Each range is copied as many times as specified in the settings

Depending on the setup, copies of the range are distributed evenly across the cluster nodes in accordance with the locality settings.

Each range has a master or leader that handles read and write requests. The node containing the leader of a particular range is called the leaseholder node of this range.

The Leaseholder node of the blue range will be the first node, since the leader is located there, the pink will be the second node, the purple will be the third, and the yellow will be the second

All data changes in a specific range are propagated from the leaseholder node to its replicas. By default, reading and writing are performed only on the leaseholder node.

To fulfill requests, you need to store knowledge of where the range leaders are. The distribution of ranges across nodes is stored in the global key spaces meta1 and meta2.

The meta1 key can be considered as a top-level search engine for ranges (range of ranges), and meta2 contains pointers on which node the required key is located. In the range meta1 there is a pointer to the desired meta2, and meta2 stores a pointer to the leaseholder node of the range containing the required key. So meta1 and meta2 can be thought of as a two-level index used to locate the master of the range being searched for.

For example, the user requested data for the key key3. In the worst case, the system will first look to meta1 to find out where the desired meta2 is located, storing the location of the range master. In the future, the system will use this pointer to find out on which node the leaseholder key3 is located. Finally, the system will contact the leaseholder node to obtain data for the desired range.

Two-level index for storing data in CockroachDB

How a query is performed in CockroachDB

When a database client—a developer or a service—sends a request to the DBMS, it can be executed on any node in the cluster. In the example below, the request was sent to the second node. First, the meta1 range is read, where it is indicated that meta2 for the desired key is located on the first node. The request is redirected to the first node, and then meta2 is read. From meta2 it is known that the range holder (leaseholder) with the required key is located on the third node. The request is redirected there. You can also work with data there.

In the image above, the database consists of three nodes: Node 1, Node 2, Node 3. Each of them contains data (keys and values). The metadata is divided into two parts (meta1 and meta2), they contain information about the distribution of keys across nodes. But all the points through which the request must pass are located on different nodes. In reality, CockroachDB processes requests faster, minimizing latency and avoiding unnecessary network operations

CockroachDB uses the Raft protocol as its distributed consensus mechanism. In CockroachDB, each range represents a separate Raft group – the consensus for each range is determined independently of other ranges. When a client sends a request to any node in the CockroachDB cluster, it is redirected to the leaseholder of the range affected by this request, and from there to the Raft leader. The Raft leader records changes in the Raft-log, which is then propagated to all range replicas. When the majority of replicas confirm the changes, Raft-leader will send the result to leaseholder, who will return it to the node executing the request. After this, the result is given to the client.

Leaseholder and Raft leader have similar functions

The Leaseholder is responsible for controlling access to the range to ensure transactional integrity and isolation. The Raft leader controls the replication and safety of data, recording changes in the Raft-log and distributing them to the corresponding range replicas. This algorithm requires a majority of nodes to acknowledge each data entry. Unfortunately, this often leads to high write latency.

How CockroachDB minimizes latency

For optimization reasons, CockroachDB tries to make sure that the leaseholder of the range coincides with the Raft leader of this range – this way it is possible to reduce the number of trips across the network between nodes and minimize latency. In general, leaseholder often coincides with the Raft leader precisely because CockroachDB, for optimization reasons, tries to keep them on the same node for a specific range.

Executing a query in CockroachDB begins with checking the cache, and not with searching for the required leaseholder in the system ranges. CockroachDB caches all possible data to keep network trips to a minimum:

CockroachDB first caches the last read data from meta2. This allows the system to immediately know where the leaseholder of a specific key is located and send a request directly to it, bypassing intermediate steps.
If the key is not cached, CockroachDB tries to access the meta2 location cache. The system skips meta1 and immediately goes to meta2 to find the necessary data and determine the location of the leaseholder.
If the meta2 location is not cached, any CockroachDB node always knows where the meta1 range leaseholder is located. This information is stored in the meta0 range. It is available to all nodes and always contains up-to-date meta1 location data.

Caching helps CockroachDB avoid unnecessary network trips, so query delays are the exception, not the rule. The system does not go through all the points on different nodes, but immediately sends the request to the desired leaseholder using cached data.

What CockroachDB looks like on the DBaaS platform in Avito

Cluster topology

By default, a CockroachDB cluster on the DBaaS platform consists of three nodes – one in each data center (DC). Such a cluster survives the unavailability of one node and one data center.

Three nodes with evenly distributed ranges

But what is the point of bringing a distributed DBMS onto the platform without the ability to scale? Let's see what happens if we increase the number of nodes to six, keeping the default settings:

Cluster of six nodes with default settings

A cluster of six nodes with default settings also painlessly survives the inaccessibility of only one node. If two cluster nodes fail, there is a risk that the failed nodes will contain the majority of replicas of some range. For example, the failure of node1 and node2 will lead to the unavailability of the pink range, the failure of node4 and node5 will lead to the unavailability of the blue range, and so on.

Here you can see that due to the fall of the data center, the data of the purple range becomes unavailable, since two out of three replicas of this range fail

This configuration no longer allows one to survive the fall of an entire data center. Each DC contains the majority of the replicas of at least one range: DC1 contains the majority of the purple range replicas, DC2 – the majority of the blue range replicas, DC3 – the majority of the yellow range replicas, so if the DC fails, inaccessible ranges will appear.

Living like this is not sporting. One option to fix this annoying behavior is to increase the replication factor, for example to five:

Cluster of six nodes with replication factor five

A cluster of six nodes with a replication factor of five allows you to survive the unavailability of any two nodes and the entire DC. However, increasing the replication factor increases storage reliability at the cost of increasing the total amount of data and the writing load on the nodes.

When starting a cluster, you can specify locality – an option that determines how CockroachDB should optimize the distribution of data across nodes

You can specify --locality region=msk, zone=<datacenter>, and then CockroachDB will distribute the data across nodes so that in each of the data centers there is no imbalance in range replicas. In this case, the failure of an entire data center will not lead to the unavailability of the cluster, but the unavailability of two nodes in different data centers will still mean a loss of consistency of part of the data.

CockroachDB with this setting will distribute ranges evenly across data centers

This is exactly the option we settled on: both with a configuration of three nodes and with a configuration of six, the CockroachDB cluster on the DBaaS platform experiences a data center decline.

Creating a database

The birth of a database on the DBaaS platform in Avito is as follows: in each data center there are autonomous and independent k8s clusters, in which a controller operates that regularly checks DBaaS for the appearance of new storage metadata.

Each DC contains a k8s cluster with a running controller

If the new storage metadata has appeared, the controller creates the necessary manifests, and CockroachDB applies them in the corresponding kubernetes cluster.

This is how the CockroachDB cluster appears on the platform, the nodes of which are distributed among data centers

The database must not only be created, but also maintained. To do this without the manual labor of developers, Avito uses an agent – a component designed to service and support the CockroachDB cluster. It runs as a sidecar container next to each node in the base.

The CockroachDB agent has extensive functionality

The agent has several functions:

creating and managing a role model – creating service and personal accounts with certain rights in the database in accordance with their access levels;
creating a database with platform configuration;
Regular launches of backups. The agent is responsible for regularly automatically creating database backups;
checking node readiness. The agent regularly sends heartbeats to monitor the health of each CockroachDB node.

Role model

A role model in a DBaaS platform is the roles and groups that define user access rights in the database.

Role is a user with the Login attribute who can log into the database. Roles are provided to end users: administrators, developers or services.
Group — a user with the NOLOGIN attribute, with which you can configure access for the role to objects through inheritance of rights in CockroachDB.

Here are the groups that exist on the DBaaS platform:

Full Access (FA). The user can perform DML (Data Manipulation Language) and DDL (Data Definition Language) operations – create objects and change data.

Read Write (RW). The user can only perform DML operations and use sequences.

Read Only (RO). The user can only read table data and current sequence values.

There are two types of user for service accounts: Deploy user and Production user.

belongs to the FA (Full Access) group;
performs DML and DDL operations;
used to create a data schema in the CockroachDB database.

Production user:

belongs to the RW (Read Write) group;
performs only DML operations;
used for normal operation of the service.

Each user is given the group that he requested and approved.

The FA group includes two users: deploy_01 and deploy_02, they are needed for rotation. If the service uses the deploy_01 account, and the password there needs to be changed, then the service switches to deploy_02 while the password for deploy_01 is updated. The RW group also includes two rotation users, just like the FA group.

How we plan to further develop CockroachDB in Avito

The DBA Avito team is already developing and supporting crdb library for working with the database from the service side, a migration mechanism and sampling. Backups are performed regularly, alerts and metrics are configured for round-the-clock uninterrupted operation of the cluster.

Key metrics we track:

Unavailable ranges, or unavailable ranges. This can happen if more than half of the range replicas are unavailable. Intervention should be made immediately to prevent data loss;
Under-replicated ranges, or insufficiently replicated ranges. This means that the number of replicas of any range does not correspond to the replication factor, that is, CockroachDB cannot additionally replicate the data;
Node Availability, or node availability. A node can fail for a variety of reasons, including hardware failures, network problems, or time discrepancies;
time synchronization. Time discrepancies between nodes can lead to failures in a distributed system. There is a possibility that the time on the cube nodes will not match, and in the event that the desync time exceeds the threshold specified in the configuration, CRDB will not be able to determine which transaction was earlier and which was later. Therefore, the cluster becomes unavailable. This happens when the system clock is out of sync, there are network delays and hardware problems.

Future plans for the development of CockroachDB on the DBaaS platform include the development of a mechanism for sending a data stream to the DWH (Data Warehouse). This tool is necessary to analyze data from different sources, store its history and provide data in a form convenient for business.

To develop this tool, we will use the built-in changefeeds functionality of CockroachDB, which allows you to track DML events. We will determine DDL operations by constantly comparing the data schema with the last recorded state. All received events will be converted to CDC (Change Data Capture) format and sent to Kafka.

It is important to remember that CockroachDB is not a replacement for other database management systems. Each technology is designed to solve specific problems – so before designing a service, you need to determine which system is best suited to your requirements

Have you had any experience with CockroachDB? What qualities and parameters of a DBMS are important to you? Tell us about it in the comments to the article! I'm waiting for questions and clarifications there. Thank you for your time for this article!

Leave a Reply Cancel reply