Tarantool architecture analysis

Even if you've never heard of Tarantool, you've probably used it: seen banners that take ad profiles from Tarantool, ordered food whose delivery is handled by Tarantool, logged into online banking and seen the spending history that Tarantool shows. The solution is actively used in many industries and scenarios, and the number of cases of its successful application is constantly growing.

But this was not always the case: over 15 years, Tarantool has come a long way, with both successes and pitfalls.

My name is Sergei Ostanevich. I am the development manager of the Tarantool platform. In this article I will talk about what the Tarantool architecture is today, how it has changed, and what advantages Tarantool now provides for architects and developers.

The beginning of the journey: a platform for developers

Tarantool was created in 2008 as an internal project for the MoiMir@Mail.ru network. Conventional disk databases with the flow of data, including user sessions and profiles, from all Mail.ru services could not cope by that time. Therefore, we decided to build a Key-value storage system.

The product was developed and existed in closed form for about two years – in 2010 we decided to roll out Tarantool to Open Source.

By the time Tarantool went beyond the scope of an internal solution (literally at the very dawn), we had already managed to implement several critical solutions in it.

Replication support to ensure data integrity even in the event of disasters. Tarantool was able to write data from one server to replica instances on other servers. This ensured that the data would always be available even in the event of any failures. At the same time, switching to a replica was somewhat difficult and associated with certain difficulties. At the same time, in the context of multi-threaded processing, we adopted a “lightweight ideology” – we have a special fiber paradigm, which implies that cooperative multitasking will be used. That is, programs must be written in such a way that they allow explicit yielding to a neighboring thread.

Transition to LuaJIT as a language for stored procedures. We used the Lua interpreter to write client stored procedures. At the same time, we began to study LuaJIT, which was not only a Lua interpreter, but also a Just-in-time compilation environment. Moreover, the solution in some aspects was even better than C compilers. As a result, we switched to LuaJIT.

The decision to make Tarantool a LuaJIT interpreter. After switching to LuaJIT, it was decided to make Tarantool a LuaJIT interpreter. So we hoped to be able to implement any services and solutions using Tarantool.

For scaling tasks, Tarantool initially supported Master-Replica type replications. Moreover, the number of replicas could be any.

Over time, we implemented a multiplexed log (WAL, Write-Ahead-Log), which allowed us to store records from several nodes at once. This is how we got support for Master-Master replications. Along with the advantages, this implementation also brought a number of difficulties: if two writes are simultaneously made on the same key on two nodes, a conflict inevitably occurs at the replication level, as a result of which both masters can fail. Such problems can be eliminated using various techniques. But in our case, the search for such solutions was unjustified: with the help of Master-Master replications, we hoped to achieve horizontal recording scaling, but in practice this is impossible. The reason is that mutual writing of masters “to each other” creates a significant additional write load.

As a result, we decided to abandon Master-Master replications and wrote a separate VShard database sharding module to ensure horizontal scaling. It splits the database into a predetermined number of buckets, and the buckets themselves are divided into replicasets (sets of three replicas with one selected master). This implementation guarantees system fault tolerance and data availability even in the event of failures in one of the replica sets.

As a result, having gone through a significant path of improvements and improvements, we hoped that:

with the help of our implementation it will be possible to solve any business problems;

based on the platform, it will be possible to easily develop an ecosystem of components that will subsequently help build new solutions;

Open Source will increase the audience and market.

However, after analyzing the real state of affairs, we came to the understanding that:

due to the similarity of requests, all solutions developed were largely similar, but customized for each client – this led to the emergence of a “zoo of solutions”;

custom solutions did not interact well with each other;

to work with the platform, a staff of engineers with programming skills was needed;

external contribution to the development of the Open-Source version of Tarantool was close to zero.

As a result, we came up with the idea of creating a ready-made solution that would offer the necessary capabilities, but would not require development from scratch – a simple setup before use would be enough. This is how Tarantool Data Grid was born.

Tarantool Data Grid

Tarantool Data Grid is a universal solution for using many data sources: ETL, MDM, CDC. It can aggregate data from disparate systems, transform it, and feed the data directly into applications.

In Tarantool Data Grid we have managed to implement many technologies and solutions, including:

data model;

special query language GraphQL;

relevant APIs;

additional level roles and users;

task scheduler and more.

This implementation has actually made Tarantool Data Grid a “Swiss knife” among tools with which you can solve many problems in different scenarios.

At the same time, with a significant expansion of the product’s functionality, we significantly complicated its support and monitoring: in a large conglomerate of technologies, it was very easy to lose sight of errors, bugs, failures at the settings level and other nuances. This also made it difficult to maintain a constant level of productivity. Moreover, we saw that there were practically no clients who used all the functionality of Tarantool Data Grid: often from a large stack of capabilities, users used only a small part, and the choice in favor of Tarantool Data Grid was made because of the opportunity to get all the necessary functions “from one box.”

Improving Tarantool UX

Simultaneously with the development of Tarantool Data Grid, we came to understand the need to develop a GUI (Graphical User Interface, graphical user interface). We called it Cartridge.

We also developed a technique called Role for writing distributed programs that occur in a cluster. Each role is responsible for solving specific tasks: each application instance can implement one or more types of roles. Moreover, a cluster can have several roles of the same type.

The algorithm for working with roles is simple:

a small fragment of Lua code is described;

specifies which node or set of nodes the role applies to;

settings are distributed throughout the cluster;

roles are launched only on those nodes where they are applied through settings.

But Tarantool still had some pitfalls. The problem was that the UI, roles and other implemented solutions ran on the same nodes that were executing the code and storing information (in fact, on the cluster busy with client work). This led to “storms”—situations in which, against the backdrop of insufficient connectivity, shards within the cluster began a seemingly endless reconfiguration with constant attempts to switch masters. As a result, the shards did not see each other and did not respond to coordinators or neighbors: the cluster was exclusively busy with reconfiguration, and this process could only be stopped by turning off the cluster. This was also a big problem because clusters serving user loads suffered from such storms.

As a result, we decided to replace Cartridge with its problems with the new Cluster Manager – a centralized solution for managing large clusters with a GUI. We made it a Stand-Alone application – it exclusively monitors the cluster, reacts to its changes, but does not affect the cluster itself.

In addition, we have provided support for the declarative definition of the cluster config – the entire cluster is described, listing where and which nodes are located. At the same time, the approach using roles to describe applications remained unchanged.

One of the main advantages of Cluster Manager is its centralized configuration storage. Thanks to this storage option, all changes are stored on a separate cluster of several nodes, and the remaining nodes subscribe to the changes and receive them “addressed”. This eliminates the need to parse the entire config, removing a significant share of problems and unnecessary loads.

Further development of the platform

As part of the development of the Tarantool platform, we did not limit ourselves to creating new products or interfaces. We also systematically made significant improvements at the technology level, which potentially expanded the pool of tasks feasible for the platform. One of the innovations was synchronous replication.

Thus, the main disadvantage of asynchronous replication is that the client can receive a response before the write reaches the replica (after writing to the master). This is fraught with risks:

there is no guarantee that the data gets to the replica;

in the event of a failure on the master, some of the data could potentially be lost (if it does not have time to get to the replica);

even in the case of an active master, when reading from a replica, you can “get” to irrelevant data (if the records have not yet reached the replica).

Synchronous replication eliminates such risks: the master first receives confirmation that the data has successfully reached the replica, and only then responds to the client. Thus, even when reading from a replica, the client can be sure that it receives up-to-date, consistent data. This increases the guarantee.

On top of synchronous replication, we implemented the Raft consensus algorithm for distributed systems. It is needed so that participants can jointly decide whether a change in the state of the database has occurred (whether the write was successful or not). To ensure consensus, Raft first identifies a leader who:

manages the distributed log;

accepts requests from clients;

replicates requests to other servers in the cluster.

If a leader fails, a new leader is elected in the cluster.

Thanks to this algorithm, Raft provides strict consistency guarantees. At the same time, it remains possible to read “past the master” – for example, the master can be used only for writing, and reading can be organized from replicas. To ensure consistency, we also implemented linearizable reads.