example of searching for a “dream storage” for a large analytical platform

To launch and operate highly loaded IT solutions with petabytes of data in assets, you need a well-developed solution that allows you to flexibly manage resources. One of the critical aspects of this solution is the separation of Compute & Storage – the division of infrastructure resources for computing and storage, respectively. If such separation is not implemented in a large project, the infrastructure risks becoming a “suitcase without a handle” – the efficiency of resource use will be low, and the complexity of managing resources and environments will be high. Using the example of the SberData team and their corporate analytical platform, I will tell you when separation of Compute & Storage is required and how to implement it as natively as possible.

The article was prepared based on the report at VK Data Meetup “How to separate Compute & Storage in Hadoop and not drown in an avalanche of migrations

Let's dive into context: an enterprise analytics platform.

The SberData team is developing an internal corporate analytical platform (CAP), which covers requests for data storage and analytics for Sber’s business blocks and tribes.

Currently the KAP platform has the following parameters:

  • stores 175 petabytes of data;

  • includes 86 different platform services – including centralized security services (authentication, authorization, audit), logging, monitoring, as well as services for storing, processing and distributing data;

  • keeps data downloaded from 780 sources up to date;

  • downloads up to 18 terabytes of data daily;

  • generates and delivers more than 1 million events per second.

At the end of 2023, the number of users of the analytical platform exceeded 22,000, and 187 full-fledged applied analytical solutions were implemented on its basis, some of which process petabytes of data. All users are exclusively internal – 93% of Sber tribes trust the platform and use it on a regular basis.

In the platform, for the implementation of application solutions, a diverse technological stack of tools is available, for different tasks and incoming requirements. In terms of data storage, the platform features its own assemblies based on the Hadoop, Greenplum, Postgres, and ClickHouse technology stacks. In this material we will dwell on our experience of implementing Hadoop, the big bumps and what we ultimately came to with it.

Starting Point: Monolithic Hadoop

We started like everyone else – with a monolithic Hadoop installation. The tool is quite popular and well studied. We initially understood that it had basic limitations:

In theory, this should not have become a critical barrier: with a block size of 256 MB, one Hadoop cluster can accommodate up to 95 PB of data. That is, in theory, two clusters would be enough to accommodate all of our 175 PB of data. An important caveat here is that everything is in theory.

Life turned out to be somewhat more prosaic, and there were more restrictions. Thus, we are faced with the fact that replication in monolithic Hadoop “eats” the available usable volume three times (part of the memory must be allocated for storing replicas, and erasure coding has not yet been implemented), and even advanced dynamic management does not help to avoid competition for resources resource queues. In addition, a huge 95 PB cluster (almost 1500 nodes) is difficult to maintain, and the likelihood of data loss increases in proportion to the number of nodes. It’s not even worth talking about the uniformity of resource consumption throughout the day.

In general, the number of open questions and dissatisfaction of our consumers was growing, and we had to determine how we would further develop our product, taking this into account. After a series of meetings and discussion of a new architecture for storing and processing data in Hadoop, we decided to abandon one large communal Hadoop cluster in favor of separate small clusters for the needs of different business units and even tribes.

This architecture has several advantages:

  • you can flexibly manage resources and order dedicated infrastructure;

  • competition for resources is reduced: the cluster does not cover the entire company, but, for example, only a separate business block or tribe;

  • native isolation of working environments is ensured;

  • Fault tolerance and storage reliability are increased: data can be distributed across different clusters.

However, this implementation was not without its drawbacks. In our experience, we have encountered a number of problems and limitations. So, together with decentralization we got:

  • data duplication;

  • an increase in the number of federations is quadratic to the number of clusters;

  • increased cost of support;

  • increased load on transmission channels caused by sending data between clusters;

  • increasing the number of racks with equipment;

  • increasing equipment imbalance;

  • reduction in equipment consumption;

  • deterioration in the uniformity of resource consumption.

At first glance, distributed architecture has more problems than monolithic architecture. But its advantages largely outweighed the difficulties that arose. Moreover, we understood how some of them could be solved.

Interim problem solving

In the process of implementing decentralized storage, it became obvious that some of the problems mentioned cannot be conceptually solved. But we were not ready to radically change the concept and give up the advantages. Therefore, at the first stage we started with “intermediate” measures:

  • We solved the problems of data duplication and a large number of federations by introducing full virtualization between Hadoop clusters.

  • The introduction of virtualization allowed consumers to stop storing data “for later,” which led to a 4-fold reduction in the load on channels.

  • We have reduced the high cost of support due to partial automation of their activities.

However, such optimization did not solve the problems associated with equipment imbalance and inefficient resource consumption.

Thus, we needed a solution that would allow:

  • maintain the benefits of decentralization;

  • increase the percentage of equipment recycling;

  • reduce the cost of data storage;

  • expand data access interfaces;

  • minimize work regarding integration into the platform and migration from Hadoop.

From analysis to selection of the final concept

Having all the cards in hand, we began to study modern trends and existing technological solutions that could help in our case. As a result, we came to the conclusion that Compute and Storage need to be separated – this allows us to solve the remaining problems of decentralization and does not require a radical reformation of the platform, including in terms of the technology stack used.

Thus, a new conceptual solution was found, which we began to implement. I will dwell in more detail only on the separation of Compute and Storage in the context of Storage. However, if the material receives sufficient response, I will talk in more detail about how we organized the separation of Compute and Storage in the context of Compute.

Where did we start?

The data storage solution in our platform has evolved quite dynamically, in accordance with the changing requirements of users.

So, in 2020, with the arrival of the first users, a request appeared for a more flexible allocation of resources on Hadoop clusters. The main goal was to be able to order only those resources that are needed, and in the volume that is required for a specific task. Internal research has shown that this can be achieved relatively painlessly using just two solutions – Ceph and Apache Ozone. At the same time, Ceph was not suitable due to the difficulties of integration into the existing platform, and Apache Ozone turned out to be simply “raw” at that time. As a result, we decided to create our own Hadoop distribution.

By 2021, the corporate analytical platform has grown significantly, and there is more data. At the same time, the problem of imbalance in the consumption of storage and computing resources has become more acute. It was unjustified to “drag” such “baggage” with us, so we decided to re-consider replacing Hadoop in the classical implementation. At the same time, the requirements for the desired solution became stricter: in addition to easy integration into the Hadoop ecosystem and high performance, the absence of fundamental scaling disadvantages inherent in Hadoop was also important.

Several open source solutions were considered as options, including Ceph, Minio and Apache Ozone. Ceph and Minio were not suitable due to poor HDFS API support. But Apache Ozone was a pleasant surprise – by 2021 it had already received a stable, fully functional version, ready for use in an industrial environment. As a result, we conducted a proof of technology with a test of the concept of separating the storage and computation layers, and after receiving positive results, we began development to integrate Apache Ozone into an existing enterprise analytics platform.

In 2022, a pilot solution was successfully tested in an industrial environment. In 2023, we brought Apache Ozone into production. That same year, we had our first customers start building their data products with Apache Ozone as their primary storage, and team data migration from Hadoop to Apache Ozone began.

Why Apache Ozone

Resolved HDFS Issues

Apache Ozone is an object storage solution designed for analytics platforms that process large volumes of data. Initially, Apache Ozone originated in the Hadoop ecosystem and was seen as an evolution of HDFS. The main driver for starting the development of Ozone was the architectural limitations that were inherent in HDFS, and the desire to overcome them, taking into account modern technological trends. It was a success.

One of the main disadvantages of HDFS that prevents the creation of large clusters is limited scalability. Thus, due to data storage in 128 MB blocks, HDFS Hadoop can accommodate about 400 million objects. Metadata is stored in blocks on the NameNode, and the size of the metadata record is fixed and independent of the block size. Thus, in situations where the file size is smaller than the block size, Hadoop starts to run into scalability issues.

The Apache Ozone developers took a comprehensive approach to overcoming this problem:

  • In Ozone, data is written into 4 MB chunks, which are packed into 256 MB blocks. The blocks, in turn, are wrapped in 5 GB containers.

  • Metadata storage and management is divided into different services. Thus, DataNode stores data about chunks and blocks. Storage Container Manager only stores metadata about containers, while Ozone Manager manages the namespace. That is, each service manages a small piece of metadata.

  • Management services were able to effectively work with metadata not only in RAM, but also on disks. This implementation of the architecture allows you to store a lot of objects in Ozone. For example, if under the management services – under Ozone Manager and Storage Container Manager – there are disks of about 3 TB, this will allow about 10 billion objects to be written to Ozone, which is enough to create large storages and solve the scaling problem.

The second disadvantage of HDFS is the problem of availability and fault tolerance. On classic Hadoop clusters, two NameNodes are installed: one is the main one, which processes all requests, the second is a backup one, which comes into operation if the main one fails. If only two servers and, accordingly, two NameNodes are lost, the cluster will become unavailable.

Apache Ozone does not have this problem because the Storage Container Manager and Ozone Manager services maintain state consistency between replicas using the Raft Ratis consensus algorithm. This allows you to increase the number of replicas of each management service to the required number.

At the same time, when working with Apache Ozone, one must take into account that there must be a minimum of three master nodes. As the number of service replicas increases, the number of master nodes must also be increased, but it should always be odd.

Easy integration with the Hadoop ecosystem

It was important for us that Apache Ozone could natively support the infrastructure that was built around Hadoop, including centralized authentication, authorization, and administration services. There were no problems with this – the competencies in Hadoop and the administration tools used in the ecosystem turned out to be applicable to Apache Ozone. Such native support for Hadoop ecosystem solutions made it possible to seamlessly integrate Apache Ozone into the existing infrastructure ensemble.

In the context of integration, it was also important that the solution in question natively supported the HDFS API – the SberData tribe wrote many application solutions for the existing corporate analytical platform specifically for Hadoop, and it was important that they could continue to work with them without the unnecessary complications of migrating application solutions to a new one storage.

Apache Ozone has such support. Moreover, if standard frameworks are used, for example, Spark, MapReduce, then practically nothing needs to be changed in the application continuation code: often it all comes down to minimal modifications to include Ozone classes in the overall code base. In the case of the enterprise analytics platform, such compatibility provided native support for the federation not only of the ecosystem and infrastructure, but also with other Hadoop clusters.

Apache Ozone Performance

Taking into account the large volume of data and the significant load on Hadoop, one of the key criteria for choosing a solution was performance. We also tested it at the proof of technology stage with separation of storage and computation layers. In the test environment, several dozen nodes of an industrial configuration were allocated, on which the following were sequentially deployed and tested:

  • Hadoop cluster on all nodes with storage and computing on the same nodes;

  • Ozone cluster on some nodes, and on the rest Spark over OpenShift;

The performance of reading different volumes of data was compared. An existing Hadoop cluster was taken as a reference for comparison. An integral performance assessment based on the sum of all tests in different scenarios showed a performance drawdown in the case of decentralized Apache Ozone at the level of 5%, which was quite acceptable.

Profit from the transition to Apache Ozone

The transition to Apache Ozone with the separation of Compute and Storage made it possible to make the management of storage resources for the analytical platform more flexible, with the ability to obtain the necessary resources in the required volume. At the same time, the resulting implementation has already given a comprehensive local positive effect:

  • reduction in TCO for storage, as it became possible to use high-density servers;

  • the number of racks in data centers has been reduced by 4 times;

  • the number of clusters decreased – from more than 60 to less than 20;

  • the cost of supporting the solution has decreased;

  • The efficiency of resource consumption increased from 40% to 80%.

Further migration of platform tools to Apache Ozone will only enhance this effect.

Afterword

The separation of Compute and Storage often turns into a complex quest, but if implemented correctly, the effect of switching to such an architecture more than justifies all the effort and potential difficulties.

In most situations, as in the case of the SberData analytical platform, the need to separate Compute and Storage is evolutionary: with the growth of data volume, number of sources, complexity and intensity of calculations, maintaining a monolithic storage multiplies the complexity of management and provokes related problems.

Separately, it is worth noting that the transition from Hadoop to Apache Ozone in the context of a specific case did not solve all the problems – Ozone in the current version (1.3) still has nuances regarding performance and fault tolerance. But in the future, they can be improved both on the side of an open community solution (the developer promises to improve the functionality already in version 1.4) and on the side of our implementation in order to obtain a storage that meets even the continuously increasing demands of large systems.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *