Open Source Data Lake Design

Data lakes (data lakes) have actually become the standard for businesses and corporations that try to use all the information they have. Open source components are often an attractive option when developing large data lakes. We will look at the general architectural patterns required to create a data lake for cloud or hybrid solutions, and also highlight a number of critical details to watch out for when implementing key components.

Data flow design

A typical logical data lake flow includes the following functional blocks:

  • Data sources;
  • Receiving data;
  • Storage node;
  • Data processing and enrichment;
  • Data analysis.

In this context, data sources are typically streams or collections of raw event data (e.g. logs, clicks, IoT telemetry, transactions).
The key feature of such sources is that the raw data is stored in its original form. The noise in this data usually consists of duplicate or incomplete records with redundant or erroneous fields.

During the ingestion phase, raw data comes from one or more data sources. The receiving mechanism is most often implemented in the form of one or more message queues with a simple component aimed at primary cleaning and saving data. In order to build an efficient, scalable, and consistent data lake, it is recommended to distinguish between simple data cleaning and more complex data enrichment tasks. One good rule of thumb is that cleanup tasks require data from a single source within a sliding window.

Hidden text

By data enrichment we mean the saturation of data with additional information from other sources (these can be some kind of reference books, information from third-party services, etc.). Data enrichment is a common process when all possible aspects should be taken into account for further analysis.

As an example, duplicate deletion, which is limited to only comparing event keys received within 60 seconds of each other from the same source, would be a typical cleanup task. On the other hand, the task of merging data from multiple data sources over a relatively long period of time (for example, over the last 24 hours) is more likely to correspond to the enrichment phase.

After the data is received and cleaned up, it is stored in the distributed file system (for increased fault tolerance). Data is often written in tabular format. When new information is written to the storage node, the data catalog containing schema and metadata can be updated using an offline crawler. The launch of the crawler is usually triggered by event, for example, when a new object arrives in the storage. Repositories are usually integrated with their catalogs. They unload the underlying schema so that the data can be accessed.

Next, the data goes into a special area dedicated to the “gold data”. From this point on, the data is ready for enrichment by other processes.

Hidden text

Data is called gold because it remains raw and semi-structured, and it is the primary source of business knowledge.

During the enrichment process, the data is additionally changed and cleaned in accordance with the business logic. As a result, they are stored in a structured format in a data warehouse or database that is used to quickly retrieve information, analytics, or a training model.

Finally, the use of data is analytics and research. This is where the extracted information is converted into business ideas through visualizations, dashboards, and reports. Also, this data is a source for forecasting using machine learning, the result of which helps to make better decisions.

Platform components

Cloud data lake infrastructure requires a robust, and in the case of hybrid cloud systems, a unified abstraction layer that can help deploy, coordinate, and run computational tasks without the constraints of API providers.

Kubernetes Is a great tool for this job. It allows you to efficiently deploy, organize and run various services and computational tasks of the data lake in a reliable and cost-effective manner. It offers a unified API that will work both on-premises and in any public or private cloud.

The platform can be roughly divided into several layers. The base layer is where we deploy Kubernetes or its equivalent. The base layer can also be used to handle computational tasks outside the domain of the data lake. When using cloud providers, it would be promising to use the already established practices of cloud providers (logging and auditing, designing minimal access, vulnerability scanning and reporting, network architecture, IAM architecture, etc.) This will achieve the required level of security and compliance with other requirements …

There are two additional levels above the base level – the data lake itself and the value output level. These two layers are responsible for the basis of the business logic as well as the data processing processes. While there are many technologies for these two levels, Kubernetes will again prove to be a good option due to its flexibility to support various computational tasks.

The data lake layer includes all the necessary services for receiving (Kafka, Kafka Connect), filtration, enrichment and processing (Flink and Spark), workflow management (Airflow). In addition, it includes data warehouses and distributed file systems (HDFS), as well as databases RDBMS and NoSQL

The topmost level is getting data values. Essentially, this is the level of consumption. It includes components such as visualization tools for understanding business intelligence, data mining tools (Jupyter Notebooks). Another important process that takes place at this level is machine learning using a training sample from a data lake.

It is important to note that an integral part of every data lake is the implementation of common DevOps practices: infrastructure as code, observability, auditing, and security. They play an important role in solving day-to-day problems and must be applied at every single level to ensure standardization, security and ease of use.

Hidden text

When choosing open source solutions at any stage of data cloud design, a good indicator is the widespread use of this solution in the industry, detailed documentation and support opensource-the community.

Interaction of platform components

Cluster Kafka will receive unfiltered and unprocessed messages and will function as a receiving node in the data lake. Kafka provides high message throughput in a reliable way. A cluster usually contains several sections for raw data, processed (for streaming), and undelivered or malformed data.

Flink receives a message from a node with raw data from Kafka, filters the data and pre-enriches it if necessary. The data is then passed back to Kafka (in a separate section for filtered and enriched data). In the event of a failure, or when the business logic changes, these messages can be called again, because that they are saved in Kafka… This is a common solution for streaming processes. Meanwhile, Flink writes all malformed messages to another section for further analysis.

Using Kafka Connect we get the ability to save data to the required data storage backends (like the golden zone in HDFS). Kafka Connect easily scales and will help you quickly increase the number of parallel processes, increasing the throughput under heavy load:

When recording from Kafka Connect in HDFS it is recommended to perform content splitting for efficiency of data handling (the less data to scan, the fewer requests and responses). After the data has been written to HDFS, serverless functionality (like OpenWhisk or Knative) will periodically update the metadata and schema settings store. As a result, it becomes possible to access the updated schema via SQL-like interface (e.g. Hive or Presto).

For subsequent data-flows and management ETL-process can be used Apache Airflow… It allows users to run multistage pipline data processing using Python and objects of type Directed Acyclic Graph (DAG). The user can define dependencies, program complex processes, and track tasks through a graphical interface. Apache Airflow can also serve to process all external data. For example, to receive data through an external API and storing them in persistent storage.

Spark governed by Apache Airflow through a special plugin, it can periodically enrich raw filtered data in accordance with business objectives, and prepare data for research by data scientists and business analysts. Data Scientists Can Use JupyterHub to manage multiple Jupyter Notebook… Therefore, it is worth using Spark for setting up multi-user interfaces for working with data, collecting and analyzing them.

For machine learning, you can use frameworks like Kubeflowusing scalability Kubernetes… The resulting training models can be returned to the system.

If we put the puzzle together, we get something like this:

Operational excellence

We have already said that the principles DevOps and DevSecOps are essential components of any data lake and should never be overlooked. With a lot of power comes a lot of responsibility, especially when all the structured and unstructured data about your business is in one place.

The basic principles will be as follows:

  1. Restrict user access;
  2. Monitoring;
  3. Data encryption;
  4. Serverless solutions;
  5. Using CI / CD processes.

Principles DevOps and DevSecOps are essential components of any data lake and should never be overlooked. With a lot of power comes a lot of responsibility, especially when all the structured and unstructured data about your business is in one place.

One of the recommended methods is to allow access only to certain services by distributing the appropriate rights, and to deny direct user access so that users cannot change data (this also applies to commands). Full monitoring by logging actions is also important to protect data.

Data encryption is another mechanism for protecting data. Stored data can be encrypted using a key management system (KMS). This will encrypt your storage system and current state. In turn, encryption during transmission can be performed using certificates for all interfaces and endpoints of services like Kafka and ElasticSearch

And in the case of search engines that may not comply with the security policy, it is better to give preference serverless-solutions. It is also necessary to abandon manual deployments, situational changes in any component of the data lake; every change must come from version control and go through a series CI-tests before deployment to the product data lake (smoke-testing, regression, etc.).


We have covered the basic design principles of an open source data lake architecture. As is often the case, the choice of approach is not always obvious and may be dictated by different business, budget and time requirements. But leveraging cloud technology to create data lakes, whether it’s a hybrid or all-cloud solution, is a growing trend in the industry. This is due to the sheer number of benefits this approach offers. It has a high level of flexibility and does not restrict development. It is important to understand that a flexible work model brings significant economic benefits, allowing you to combine, scale and improve the applied processes.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *