Modern lakehouse data platform Data Ocean Nova

Hello. My name is Evgeniy Vilkov. I have been working in management and data integration systems since 2002, and specifically in data analysis and processing systems since 2007. The technologies that I have dealt with throughout my professional journey have developed rapidly. Starting with solutions based on a stack of traditional DBMSs, such as Oracle, MS SQL Server, Postgres, gradually evolving into now classic (and some even closed) MPP systems, such as Teradata, GreenPlum, Netezza, Vertica, IQ, HANA , Exadata, ClickHouse, various solutions based on the Hadoop ecosystem and cloud services and platforms.

The world is changing, technologies are changing, approaches to design are changing, and the requirements for the tasks of the analytical data landscape are changing.

Rice. Data Lakehouse (Gemini AI Generated)

Rice. Data Lakehouse (Gemini AI Generated)

I'm sure many who are already familiar with the terms Data Mesh and Data Lakehouse are wondering what the analytics market has to offer in these design methodologies and architectural approaches. I want to talk about the data analytics platform Data Ocean Novathe owner and technological ideologist of which I am.

Before plunging you into the architecture, I will provide a series of statements verified by many years of practical project experience in the field of Big Data / MPP / DWH. On its basis, the technical requirements for our Lakehouse system were formed.

Statement 1. Several years ago, it became obvious that all analytical data warehouse tasks could be solved in the Hadoop ecosystem. An important condition here is the correct architectural approach, which consists in choosing or preparing the right distribution and the right choice of engines and frameworks. I won’t dwell on this in detail, as I have already published material on this topic in the community “How to build a modern analytical data warehouse based on Cloudera Hadoop

Statement 2coming out of the first. The cost of a system built with this approach is lower than a system based on the classic MPP stack (and even more so than on the heterogeneous approach “Classical MPP + Hadoop extension”). Moreover, the productivity of such a system is higher with the same capital costs.

I’ll immediately (naively) try to avoid holy war: there are no good or bad systems. There are systems that solve certain problems of working with data with varying degrees of efficiency. Some are more effective, some are less, or require moreABOUTgreater investment in hardware and additional design solutions. In this context, we are talking about the average requirements for what a modern big data analytical system should be able to do:

  • Efficiently download data in batch and online mode;

  • Efficiently store data;

  • Solve data transformation problems of any complexity without restrictions (calculation of analytical layers, showcases, provision of data services);

  • Provide highly competitive, managed user access directly or using third-party analytics tools;

  • Solve the analytical sandbox problem;

  • Scale effectively.

Statement 3. The Hadoop architecture has its inherent flaws. Some defects require additional design solutions, others are an integral part of the DNA.

I will try to further describe what cannot be solved by design.

Combined storage and compute. Practice has shown that it is very difficult to select a configuration so that it is balanced over a long service life. Either HDFS is clogged, but there is still enough RAM and CPU, or vice versa. Any scaling of a symmetrical MPP system always means overpaying for unused capacity. Scaling itself is a long and expensive process. The scenario that involves “increasing power quickly for some time” is not feasible if there is no free equipment suitable for the configuration.

Shared load and isolating resources between different consumer groups. It is difficult to implement scenarios for sharing resources between different engines and frameworks for different tasks and users. You have to configure it at the operating system level (cgroups), in YARN and at the level of the engine or framework itself (Admission control). This approach greatly complicates the event-based distribution of resources and makes it impossible to dynamically distribute them upscale\downscale, and even more so to quickly create an area isolated by resources to implement the Data Mesh concept.

HDFS sensitivity to online operations loading data. In practice, this issue can be resolved, but the specificity of the settings will have a negative impact on the analytical tasks of batch processing. Or the task has to be put into a separate storage format, as described here (“Using Kudu to solve real-time problems in a Hadoop environment“), or allocate a separate HDFS cluster for HBase.

Excessive storage to ensure fault tolerance, which leads to the loss of ⅔ disk space (in practice, erasure coding is almost never used in HDFS).

Hadoop is not a cloud systemin its purest form for sure. HDFS on IaaS will be too slow due to the characteristics of the cloud (virtual network drives) and costly (reserving disks for redundant storage with a replication factor or fully reserving servers), HDFS as PaaS does not make sense at all in itself, either for the cloud operator or for the consumer. Although Cloudera offers its solution for public cloud providers, it loses competitively both to other market participants like Snowflake and Databricks, and to the internal solutions of the cloud providers themselves.

Architecture

When developing the architecture, our team was faced with the task of maintaining the advantages indicated in statements 1 and 2, and bypassing the disadvantages of statement 3, and as a result, obtain the most effective system for working with data. Let's look at each component of the architecture Data Ocean Nova separately.

Rice. Architecture of Data Ocean Nova.

Rice. Architecture of Data Ocean Nova.

Data storage

S3 storage (AWS API specification) is used as storage instead of HDFS. This is a necessary prerequisite for installation and operation. In a cloud environment, it is recommended to use the S3 service from a cloud provider, and not try to build it on top of IaaS, except in special cases. In an on-premise installation, you can use any S3 AWS API compatible solution, be it a ready-made appliance or recommended equipment with installed software. The main thing to remember is that the hardware and workload requirements of S3 storage must meet your goals.

Especially when we are talking about a high-load analytical system with high competition for access with the ability to load data in real time, and not about storing backups or “cold” archives.

In this scenario, a minIO-based service has proven itself very well and can be delivered as part of an overall solution.

The recommended, but not required file format for storing data in object storage is Parquet, which has built-in optimization techniques (storage indexes and page indexes). They are supported by the main engines and frameworks included in Nova, which allows you to achieve maximum performance.

We use Iceberg as a table storage format extension.

What does the Iceberg tabular format provide:

  • Support for ACID requirements out of the box, without resorting to additional design solutions;

  • Evolution

    • Any evolution of the data structure up to changing the partitioning key on the fly without the need to recreate the table and reload the data;

    • The evolution of the data itself, tied to its structure;

    • Implicit partitioning without the need to create surrogate partition key definition computation fields (including explicit use of partition pruning key definition expressions);

  • Time travel to the desired data version;

  • Provides very high rates of online data integration in conjunction with S3 (example use case – Change Data Capture -> Kafka -> Application -> S3 Iceberg);

  • Support for SQL operations Update\Merge\Delete for our processing engines.

Resource management, fault tolerance, orchestration

Data Ocean Nova is an application running in a containerization management environment.

Any specialized software can be used as a containerization subsystem, be it vanilla Kubernetes, which can be supplied as a general solution, or Russian analogues included in the register of domestic software.

The containerization environment is responsible for sharing and isolating resources between Nova components and ensuring their fault tolerance.

Metadata and information security

Services that provide work with metadata and are responsible for information security, integration with LDAP and logging of all events:

  • Hive Meta Store as a technical meta directory;

  • Ranger as a role-based access management service;

  • Open Search as a single repository for all infrastructure and service logs.

All system components, engines and frameworks work in a common role model, controlled from a single service and interface.

Compute

As processing engines, the choice was made in favor of the following:

Impala. One of the most high-performance MPP engines on the market, capable of performing a wide range of data processing and access tasks. Practice has shown that Impala is faster not only than analogues from the world “near Hadoop”, but also any system from the world of classic MPP DBMSs like Teradata or GreenPlum.

It is recommended for use as the main ANSI SQL engine for performing SQL data transformation tasks (ETL), analytical access for applications and BI, custom ad-hoc data analysis and self-service tasks of data engineers.

The reasons for Impala’s efficient operation are simple and obvious: very selective readings due to the use of indexes built into the Parquet storage and page format; active data filtering at the reading stage (bloom filters, dynamic filters); managed intra-node parallelism (several processing threads on one node for one operator). Impala performs especially well in highly competitive processing, when the system receives a large number of requests simultaneously, as it has advanced mechanisms for managing such a load (Admission Control).

None of the classical massive data processing systems has such a list of capabilities. So far, none of the modern processing engines with the listed capabilities use them so effectively in terms of allocated compute power.

Trino. The popular MPP engine is largely similar in capabilities to Impala, but still loses to it when performing highly competitive (tens and hundreds of requests simultaneously) access and processing tasks. Practice has shown that for Impala to be comparable in performance to Trino, more compute resources are required, and the higher the competitive load, the greater the gap. Alas, Trino is a java application running inside the JVM and requires corresponding overhead, unlike C++ with the LLVM approach of the Impala kernel.

However, Trino is part of Data Ocean Nova, since it is indispensable when solving heterogeneous access problems, when part of the data is located within the perimeter of Nova’s Lakehouse system, and part may be located in a third-party system (Oracle, Mongo, Postgres, and so on).

Trino, of course, can be an alternative for all SQL tasks where it is recommended to use Impala, taking into account the increased resource overhead and weaker resource management mechanisms.

Spark. This is the case when comments are unnecessary. De facto – a standard for data engineering tasks, including beyond SQL analysis. But if you have a reason to use Spark SQL, such as for remote client connections, then Spark Thrift can handle your needs. Spark Connect is responsible for connections external to the system using the Dataframe API in the system. YuniKorn is used as a resource scheduler for spark tasks.

All of the above engines and frameworks are, of course, not required to be installed. Which engine to “raise” and what resources to allocate to it is up to you to decide.

Client connections and analysis tools

It is generally good practice to provide the ability to work with the system out of the box without installing additional client software. As thin client tools, it is possible to use:

  • Hue is an SQL client that supports our processing engines Impala, Trino, Spark and their native integrations. Can work in notebook mode. Supports multi tenancy mode for engines. Has built-in mechanisms for import\export of data and reporting;

  • JupyterHub, which has become a standard in a certain segment as a workplace for a data engineer.

For any external tools that are not part of Nova, it is possible to establish a connection via JDBC or ODBC. Software familiar to users like DBeaver, FineBi, Apache Superset, PowerBI, SAS, Tableau, Excel, Qlik and so on also works with our Lakehouse solution.

Additional product services

In addition to functionality based on open source components, Nova contains internal product services developed by our team to solve various problems within a holistic solution. Examples of services:

  • Service for downloading and parsing to a standard form all logs, infrastructure components, engines and frameworks into a single index storage;

  • A service for implementing a storage data dictionary with detailed information about objects for visualizing the dictionary in the cluster manager, generating service maintenance procedures, a heat map of data, an automatic recommendation system for optimizing objects, and so on;

  • AI recommendation optimization service: predicting session parameters for a request to increase system throughput;

  • Object file subsystem maintenance management service and built-in service jobs for periodic maintenance of the “small file problem”, Iceberg table format and so on.

Cluster Manager

Cluster Manager is a web-based GUI application designed to manage and monitor the system. The purpose of the SM is to provide the system administrator with the ability to perform actions in a graphical environment without resorting to using the CLI interface. In addition, the SM has functionality for analyzing the load on the system, determining the health status of Nova services, visually assessing the distribution of services and load on kubernetes machines, allows you to change the settings of resource groups, track user requests and assess their toxicity, create routine maintenance processes for the Nova environment and manage their implementation and much more.

Rice. Cluster Manager Displays

Rice. Cluster Manager Displays

Rice. Cluster Manager Displays

Rice. Cluster Manager Displays

Benefits of the architectural approach used in Data Ocean Nova

Independent and dynamic scaling Storage and Compute. Architecturally, there is no guessing trap with sizing based on disk storage or computing power. Whatever limitation you hit, you scale it. Don’t be afraid of the stereotype: “It’s not data locality!” If you use the right processing engines that provide selective reading and writing, if you have properly planned your network infrastructure, and if your storage does not make unnecessary data transfers within the cluster associated with providing the replication factor, then this risk is mitigated as much as possible. In addition, our engines make excellent use of local caching and do not read blocks from S3 each time that have not changed.

You have the opportunity to increase additional computing power by simply adding it to kubernetes. This can be either hardware capacity or virtual infrastructure. Imagine that in order to train a model or conduct functional testing of a release, you need to quickly increase the power of the system for a short period of time (real use cases of our clients). How much easier this task is!

Isolation of the environment. You have the opportunity to have separate computing clusters with their own isolated (as needed) resources to solve different problems or for each business domain if you intend to adhere to the Data Mesh concept.

Example of resource sharing:

  • Regulatory Process Cluster

  • Cluster for connecting external analytical systems

  • User cluster Business Domain 1

  • User cluster Business domain 2

In one instance of Data Ocean Nova, there will be a single metadata layer and a single role model for different domains and areas of compute engines. In this case, you can have a common data layer for all domains (for example, a single analytical core) and private areas of each domain for publishing data products.

By different computing clusters in this context we mean not just a different set of engines and services, but also the multi tenancy principle, in which you can have, for example, more than one isolated Impala or Trino cluster, each of which contains its own number of nodes and its own characteristics of each node .

Resources between tenants or services are not “nailed.” You can redistribute them, including dynamically, according to given rules. It is possible to create computing clusters event-based, including guided by the infrastructure-as-a-code principle. Creating an environment for a business domain will take minutes, and the domain itself will pay only for its computing power (in the case of a cloud installation, literally, with correctly configured billing).

Rice. Example of Lakehouse architecture with domain separation

Rice. Example of Lakehouse architecture with domain separation

Real cases of using multi tenancy:

Case 1. The client deployed two Impala tenants with different computing power on Kubernetes workers hosted on hardware in one data center, and a third Impala tenant with different node characteristics on k8s machines located on the virtual infrastructure of a private cloud in another data center.

Case 2. The client is using Nova cloud installation. Three independent Impala tenants, each with its own number of nodes and characteristics:

  • Impala 1. To service data loading and all regulatory ETL calculations.

  • Impala 2. Dedicated tenant for ad-hoc analysis of a dedicated business unit.

  • Impala 3. Dedicated tenant for ad-hoc analysis of another dedicated business unit with the ability to dynamically scale.

Two resource isolated areas of Spark computing:

  • Data loading processes;

  • Processes of routine maintenance of the Lakehouse environment.

Cloud installation

Data Ocean Nova can be installed both in an on-premise environment and in a private or public cloud infrastructure. At the same time, it is possible to use cloud services provided by the operator, replacing with them the services supplied as part of the product. Basically, of course, this applies to technical services like monitoring and logging.

As an example, I want to share a client installation with maximum use of SaaS from the operator.

The Nova installation from the Yandex.Cloud operator uses:

All other internal Nova services and components work in the managed k8s environment from the cloud provider.

Rice. One of the Nova client installations in the Yandex cloud environment

Rice. One of the Nova client installations in the Yandex cloud environment

This approach, on the one hand, requires division of administration responsibilities between the client and the cloud operator, but on the other hand, it provides maximum flexibility, including consuming and paying for resources strictly in accordance with need and isolating service functions from the main functions of the system.

It becomes obvious that the use of a hybrid installation (part cloud, part on-prem) depends only on your desire. There are all architectural possibilities for this.

When should you consider Lakehouse solution Data Ocean Nova

The scope of our system is not only determined by the architecture, but also confirmed by successful customer implementations.

Nova is worth considering if you:

  • Interested in purchasing a universal Lakehouse analytical data platform capable of solving the entire range of problems (one box system instead of 3 or even 4 systems):

    • Real-time ODS, including in Upsert mode;

    • High-performance batch processing and data transformation (scenario of a classic analytical warehouse and construction of analytical layers and showcases);

    • User access to solve any problems with structured and unstructured data;

    • Back computing environment for analytical applications or ML platforms.

  • Want to get maximum system performance relative to capital costs. Cheaper and faster than Hadoop, Teradata, and GreenPlum.

  • You plan to follow the Data Mesh design approach, but in a single infrastructure environment.

  • You cannot predict at the start the sizing of the system that will correspond to the development of your business, and you need flexibility and room for error in scaling.

  • Do you expect to use a Multitenancy approach, in which several isolated computing clusters work on shared data, and even with the ability to use different versions of services simultaneously.

  • You are creating a storage facility with the need for heterogeneous access.

  • Are focused on the cloud, but the capabilities of cloud services for working with data are not enough for you or are completely absent in the case of a private cloud.

Data Ocean Nova At the moment, it is the only ready-made enterprise-level solution on the Russian market that makes it possible to implement the Lakehouse concept. Our clients' design solutions implemented on the basis of Nova have been in commercial operation for at least a year.

What makes Nova unique on the market?

To correctly evaluate the solution and the amount of labor required to develop it yourself, which makes the product a product, remember that:

  1. There is no source code or build system for a complete Lakehouse class solution from which a similar system can be built independently by an internal team;

  2. Spark running on Kubernetes on top of S3 or HDFS is not Lakehouse or a product;

  3. Using and running pre-built images like Trino is not a Lakehouse platform or product.

Rice. Product team composition

Rice. Product team composition

In order for Nova to reasonably be called a finished product, our team:

  • Makes changes to the source code of the opensource component for consistent collaboration within the overall solution;

  • Corrects previously unknown errors in the component source code identified by customers and our development team;

  • Backports fixes for known errors in the component source code;

  • Creates and finalizes new functionality of the core part of engines and libraries relative to open source (new functions of engines and frameworks, optimization of engines, new capabilities of engines for working with S3 and tabular format, not available in opensource);

  • Prepares the distribution for installation;

  • Backports new functionality of individual components from upstream versions of open source;

  • Adapts all components to work in the Kubernetes environment (including modifying application source code);

  • Promptly releases new versions when an opensource component is released;

  • Supports enterprise-level information security requirements for Kubernetes (for example, Kubernetes CIS or other standards);

  • Independently or together with partners, it ensures compatibility with the basic software of the Russian Federation from the Ministry of Digital Development registry: OS (Viola, Astra, RedOS), Kubernetes assemblies (Orion Nova, Flant, Viola), PostgreSQL (PostgresPro, Panglin), S3 appliance.

  • Provides full technical support.

Our team has many years of practical experience in building high-load data processing systems using the technologies used in our Data Ocean Nova system. We are part of the GlowByte group of companies – one of the most experienced Big Data teams, whose vision is based on the enormous experience of successful project work in the Russian and foreign markets, and not on analyzing source code in GIT and studying open access publications. We invest our knowledge and experience into the final product and provide a high level of service, which creates a unique added value.

This publication turned out to be quite voluminous, although many topics remained undisclosed. Therefore, in the near future we plan to continue sharing materials on the following topics:

  • Results of comparative tests of Data Ocean Nova and other systems and comparison of modern processing engines.

  • How to properly test massively parallel processing analytical systems.

  • How to estimate the cost of system ownership. What criteria need to be taken into account.

*****

Lakehouse data platform Data Ocean Nova developed Data Science (part of the GlowByte group of companies).

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *