AERODISK vAIR architecture or features of national cluster building

Hello, Khabrovchans! We continue to acquaint you with the Russian hyperconverged system AERODISK vAIR. This article will focus on the architecture of this system. In the last article, we parsed our ARDFS file system, and in this article we will go through all the main software components that make up vAIR and their tasks.

We begin the description of the architecture from the bottom up – from storage to management.

ARDFS + Raft Cluster Driver File System

The basis of vAIR is the distributed file system ARDFS, which combines the local disks of all cluster nodes into a single logical pool, on the basis of which virtual disks with one or another fault tolerance scheme (Replication factor or Erasure coding) are formed from 4MB virtual blocks. A more detailed description of the work of ARDFS is given in the previous article.

Raft Cluster Driver is an internal ARDFS service that solves the problem of distributed and reliable storage of file system metadata.

ARDFS metadata is conventionally divided into two classes.

  • notifications – information about operations with storage objects and information about the objects themselves;
  • service information – setting locks and configuration information for node storage.

The RCD service is used to distribute this data. It automatically assigns a node with the role of a leader whose task is to obtain and disseminate metadata across the nodes. A leader is the only true source of this information. In addition, the leader organizes a heart-beat, i.e. checks the availability of all storage nodes (this has no relation to the availability of virtual machines, RCD is just a service for storage).

If for any reason the leader has become unavailable for one of the ordinary nodes for more than one second, this ordinary node organizes a re-election of the leader, requesting the availability of the leader from other ordinary nodes. If a quorum is gathered, the leader is re-elected. After the former leader "woke up", he automatically becomes an ordinary node, because the new leader sends him the appropriate team.

The logic of the RCD is not new. Many third-party and commercial and free solutions are also guided by this logic, but these solutions did not suit us (as well as the existing open-source FS), because they are quite heavy, and it is very difficult to optimize them for our simple tasks, so we just wrote our own RCD service.
It may seem that the leader is a “narrow neck” that can slow down work in large clusters by hundreds of nodes, but this is not so. The described process occurs almost instantly and "weighs" very little since we wrote it ourselves and included only the most necessary functions. In addition, it happens completely automatically, leaving only messages in the logs.

MasterIO – Multi-threaded I / O Management Service

Once an ARDFS pool with virtual disks is organized, it can be used for I / O. At this point, the question arises specifically for hyperconverged systems, namely: how much system resources (CPU / RAM) can we donate for IO?

In classical storage systems, this issue is not so acute, because the storage task is only to store data (and most of the system storage resources can be safely given under IO), and hyperconvergence tasks, in addition to storage, also include the execution of virtual machines. Accordingly, the GCS requires the use of CPU and RAM resources primarily for virtual machines. Well, what about I / O?

To solve this problem, vAIR uses the I / O management service: MasterIO. The task of the service is simple – “Take everything and share” it is guaranteed to pick up the nth number of system resources for input and output and, starting from them, start the nth number of input / output streams.

Initially, we wanted to provide a “very smart” mechanism for allocating resources for IO. For example, if there is no load on the storage, then system resources can be used for virtual machines, and if the load appears, these resources are “softly” removed from the virtual machines within predetermined limits. But this attempt ended in partial failure. Tests showed that if the load is increased gradually, then everything is OK, resources (marked for possible removal) are gradually withdrawn from the VM in favor of I / O. But here the sharp bursts of loads on the storage provoke a not so “soft” withdrawal of resources from the virtual machines, and as a result, queues accumulate on the processors and, as a result, and the wolves are hungry and the sheep are dead and virtualka hang, and there are no IOPS.

Perhaps in the future we will return to this problem, but for now we have implemented the issuance of resources for IO in the good old grandfather way – hands.

Based on the sizing data, the administrator pre-allocates the nth number of CPU cores and RAM volume for the MasterIO service. These resources are allocated monopoly, i.e. they cannot be used in any way for the needs of VMs until the admin allows this. Resources are allocated evenly, i.e. the same amount of system resources is removed from each node of the cluster. First of all, processor resources are of interest to MasterIO (RAM is less important), especially if we use Erasure coding.

If a mistake occurred with sizing, and we gave too many resources to MasterIO, then the situation is easily solved by removing these resources back to the VM resource pool. If the resources are idle, then they will almost immediately return to the VM resource pool, but if these resources are utilized, you will have to wait a while until MasterIO releases them softly.

The reverse situation is more complicated. If we needed to increase the number of cores for MasterIO, and they are busy with virtuals, then we have to “negotiate” with virtuals, that is, select them with handles, since in automatic mode in a situation of a sharp burst of load, this operation is fraught with VM freezes and other capricious behavior.

Accordingly, a lot of attention needs to be paid to sizing the performance of IO hyperconverged systems (not only ours). A little later in one of the articles we promise to consider this issue in more detail.


Hypervisor Aist is responsible for running virtual machines in vAIR. This hypervisor is based on the time-tested KVM hypervisor. In principle, quite a lot has been written about the work of KVM, so there is no particular need to paint it, just point out that all the standard functions of KVM are stored in Stork and work fine.

Therefore, here we describe the main differences from the standard KVM, which we implemented in Stork. The stork is part of the system (pre-installed hypervisor) and it is controlled from the common vAIR console via the Web-GUI (Russian and English versions) and SSH (obviously, only English).

In addition, the hypervisor configurations are stored in the distributed ConfigDB database (about it a little later), which is also a single point of control. That is, you can connect to any node in the cluster and manage everything without the need for a separate management server.

An important addition to the standard KVM functionality is the HA module we developed. This is the simplest implementation of a cluster of high availability virtual machines, which allows you to automatically restart the virtual machine on another cluster node in the event of a node failure.

Another useful feature is the mass deployment of virtual machines (relevant for VDI environments), which will automate the deployment of virtual machines with their automatic distribution between nodes depending on the load on them.

VM distribution between nodes is the basis for automatic load balancing (ala DRS). This feature is not yet available in the current release, but we are actively working on it and it will definitely appear in one of the next updates.

The VMware ESXi hypervisor is optionally supported; currently, it is implemented using the iSCSI protocol; in the future, support for NFS is also planned.

Virtual switches

For the software implementation of the switches, a separate component is provided – Fractal. As in our other components, we go from simple to complex, so in the first version simple switching is implemented, while routing and firewalling are given to third-party devices. The principle of operation is standard. The physical interface of the server is connected by a bridge to the Fractal object – a group of ports. A group of ports, in turn, with the desired virtual machines in the cluster. Organization of VLANs is supported, and in one of the next releases VxLAN support will be added. All created switches are distributed by default, i.e. distributed over all nodes of the cluster, so which virtual machines to which switches to connect to the VM does not depend on the location node, this is a matter of the administrator’s decision only.

Monitoring and statistics

The component responsible for monitoring and statistics (working title Monica) is, in fact, a reworked clone from the ENGINE storage system. At one time, he recommended himself well and we decided to use it with vAIR with easy tuning. Like all other components, Monica is executed and stored on all nodes of the cluster at the same time.

Monica's difficult responsibilities can be outlined as follows:

Data collection:

  • from hardware sensors (that which can give iron over IPMI);
  • from vAIR logical objects (ARDFS, Stork, Fractal, MasterIO and other objects).

Collecting data in a distributed database;

Interpretation of data in the form of:

  • logs;
  • Alerts
  • schedules.

External interaction with third-party systems via SMTP (sending email alerts) and SNMP (interaction with third-party monitoring systems).

Distributed configuration database

In the previous paragraphs, it was mentioned that many data is stored on all nodes of the cluster at the same time. To organize such a storage method, a special distributed ConfigDB database is provided. As the name implies, the database stores the configurations of all cluster objects: hypervisor, virtual machines, HA module, switches, file system (not to be confused with the FS metadata database, this is another database), as well as statistics. This data is synchronously stored on all nodes and the consistency of this data is a prerequisite for the stable operation of vAIR.

An important point: although the functioning of ConfigDB is vital for vAIR operation, its failure, although it will stop the cluster, does not affect the consistency of the data stored in ARDFS, which in our opinion is a plus to the reliability of the solution as a whole.

ConfigDB is also a single point of management, so you can go to any node of the cluster by IP address and fully manage all the nodes of the cluster, which is quite convenient.

In addition, for accessing external systems, ConfigDB provides a Restful API through which you can configure integration with third-party systems. For example, we recently made pilot integration with several Russian solutions in the fields of VDI and information security. When the projects are completed, we will be happy to write technical details here.

The whole picture

As a result, we have two versions of the system architecture.

In the first – main case – our KVM-based Aist hypervisor and Fractal software switches are used.

Scenario 1. True

In the second – optional option – when you want to use the ESXi hypervisor, the scheme is somewhat complicated. In order to use ESXi, it must be installed in the standard way on the local drives of the cluster. Next, on each ESXi node, the vAIR MasterVM virtual machine is installed, which contains a special vAIR distribution to run as a VMware virtual machine.

ESXi gives all free local disks by direct forwarding to MasterVM. Inside MasterVM, these disks are already standardly formatted in ARDFS and delivered to the outside (or rather, back to ESXi) using the iSCSI protocol (and in the future there will also be NFS) via interfaces dedicated to ESXi. Accordingly, the virtual machines and software network in this case are provided by ESXi.

Scenario 2. ESXi

So, we have disassembled all the main components of the vAIR architecture and their tasks. In the next article we will talk about the already implemented functionality and plans for the near future.

We are waiting for comments and suggestions.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *