Data Warehouse, Data Lake, Data Lakehouse, Data Fabric, Data Mesh – what is it and what is the difference between the concepts

Data today has become an important resource for business, but in order to make the right management decisions based on it, it needs to be stored and processed correctly.

Data warehouses provide infrastructure and tools for collecting, storing, and processing large amounts of information for further business analytics.

The era of modern data warehouses began with the advent of relational databases (hereinafter referred to as DBs). With the advent of business analytics, a consequence of the development of the database was the concept of Data Warehouse (corporate data warehouse, DWH).

The continued growth of data volumes, the introduction of the term “big data” and the diversity of processing requirements have led to the evolution of data architectures.

Data architecture includes models, rules, and standards that determine how to collect, store, host, integrate, and use data to solve business problems.

Concepts emerged:

  • Data Lake

  • Data Lakehouse

  • Data Fabric

  • Data Mesh

In this article we will look at the stages of the evolution of data architecture: how the concepts differ, what their advantages and disadvantages are, and what tasks in working with data they are suitable for.

Contents of the article:

  1. Chronology of development of databases and data warehouses

  2. What is a Data Lake and how is it different from a Data Warehouse?

  3. Data Lakehouse: combining the benefits of Data Lake and DWH

  4. Data Fabric – ML-enhanced storage

  5. Decentralization of data in Data Mesh

Chronology of development of databases and data warehouses

To better understand how and why the technologies we use today appeared, let’s dive into the history of the first databases, DBMSs and data warehouses.


1960–1970

Due to the development of the banking and financial sector in the 60s, there was a need for a reliable model for storing and processing data.

It is generally accepted that the modern history of data processing began with the creation of the organization CODASYL.

CODASYL (Conference of Data Systems Language)

CODASYL (Conference of Data Systems Language)

In 1959, the organization created the COBOL programming language, and in 1969 it published a language specification for the DBTG network data model, or CODASYL Data Model.

The prototype for these first database standards was the development of the Integrated Data Store (IDS) network database in 1963 by General Electric's computer division.

At the same time, in the 1960s, the concept of the information economy and prototypes of BI appeared – systems for making decisions based on raw data.

In 1966, for the Apollo space program, IBM developed the first industrial DBMS – the hierarchical IMS (Information Management System) system, whose task was to process product specifications for the Saturn 5 launch vehicle and the Apollo shuttle.


1970–1980

Previously developed data models were difficult to maintain, and computer scientist Edgar Frank “Ted” Codd begins work on data storage theory to simplify the process.

In 1970, he published A Relational Model of Data for Large Shared Data Banks.

Edgar Frank

Edgar Frank “Ted” Codd

Codd's work became the basis for the emergence of the relational data model, and also inspired his colleagues at IBM to create the SEQUEL (Structured English Query Language) programming language, which later became SQL.

In 1975, the ANSI-SPARC three-tier architecture project was put forward to build a DBMS, the purpose of which is to separate the user view of the database from its physical representation.

A year later, an ER data model (Entity-Relationship model) was developed, highlighting key entities and denoting the connections that can be established between these entities.

In 1979, the first version of the RDBMS (relational database management system) Oracle Database based on SQL was released, and in 1980 the dBASE DBMS was released, which became the most popular among all products that existed at that time.


1980–1990

Further development of the relational model and the appearance on the market of several DBMS options led to the need to develop DBMS standards.

Edgar Codd publishes Codd's famous 12 Rules that every relational database management system must adhere to.

In 1986, the American National Standards Institute introduced the first version of the SQL language standard, and in 1987, the first international SQL language standard, ICO SQL-87, was adopted.

The University of California is launching a non-profit project POSTGRES to create a DBMS of the same name for research and production tasks: analyzing financial data, monitoring the operation of jet engines, storing data from medical institutions, geographic information systems and others.

In parallel, IBM is developing the DB2 family of data management products (IBM Database 2 RDBMS), which offer a convenient interface for populating data and built-in tools for generating reports.

Sybase is starting to package Sun Microsystems or Apollo Computer smart workstations with database servers.

The number of PCs and local networks is increasing, and there is a need to develop methods for combining, loading and retrieving data for a DBMS, and harmonizing data storage formats.

Scheme

Scheme “Snowflake”

During this period, the concept of a data warehouse was born, a separate database isolated from production systems, the use of which reduces the load on operational databases.
In data modeling for such databases, they are beginning to use the star and snowflake schema.

The term OLTP appears – Online Transaction Processing, processing transactions (sequential operations with a database) in real time.

At the same time, Howard Dresner, later a Gartner analyst, coined the term Business Intelligence as “concepts and techniques for improving business decision making using systems based on business data.”


1990–2000

In 1991, American scientist Bill Inmon published the book “Building the Data Warehouse” and introduced the concept of DWH as a centralized repository of all company data, the development of which begins with the creation of a normalized data model.

The antagonist of Inmon's theory is Ralph Kimball, who uses a bottom-up approach in which individual storefronts are created, aimed at solving specific business problems and coordinated at the dimensional level into a conceptual data warehouse. Kimball's idea aims to denormalize the storage structure and simplify analysis for business users. Inmon's and Kimball's concepts are widely used today, often in parallel.

In 1993, the already famous Edgar Codd proposed “12 rules for analytical processing in real time,” and this is how OLAP (OnLine Analytical Processing) appeared – a technology for interactive analytical processing of data in real time.

In 1990–1991, Microsoft released the first releases of relational SQL Server, which immediately took a leading position in the DBMS market.

PostgreSQL

In 1994, based on the non-profit DBMS Postgres with the addition of a SQL interpreter, Postgres95 appeared and began its journey as an open source project.

In 1996, the PostgreSQL object-relational DBMS was created on its basis.


2000–2010

In the 2000s, the concept of “Big Data” appeared to denote structured and unstructured data of huge volumes.

The creation of highly scalable Internet applications, such as search engines and email services, began.
As part of these projects, non-relational DBMSs (NoSQL) began to be created and used to cope with the rapid growth in data volume and load.

To solve the problem of vertical scaling (when, as data volumes increase, it is necessary to add resources – memory and disks) and parallel computing, Google offers the MapReduce distributed computing model.

In 2006, on its basis, Doug Cutting created the fundamental big data technology – the Hadoop distributed file system (HDFS).

Data Vault

Data Vault

At the same time, in 2000, Dan Linstedt presented the Data Vault design model for corporate data warehouses, and in 2009, Lars Rönnbeck and Olle Regardt presented Anchor Modeling.


2010–2020

The ecosystem of Hadoop projects to simplify work with MapReduce includes the frameworks Apache Spark, Apache PySpark, Apache Hive DBMS with SQL-like query language and Apache Pig.

It becomes possible to process not only structured but also unstructured data, and the possibilities for scaling storage are increasing.

Architectures are also evolving, Data Lake and Data Fabric are appearing, more flexible compared to the classic DWH.

At the same time, the Greenplum DBMS was designed based on PostgreSQL using a massively parallel architecture (MPP).
The Yandex company has developed the first prototypes of the columnar analytical DBMS ClickHouse.
The Apache ecosystem is growing, Apache Airflow, Kafka, Storm appear.

In 2011–2014, several cloud proprietary distributed storages for analytics appeared, where the data was divided into fragments and distributed across nodes (servers).
Google BigQuery, Amazon Redshift, and Snowflake warehouses have made it possible to increase data volumes and compute requirements without the need to scale up or scale out the infrastructure.

During the same period, a trend emerged towards moving data infrastructure to the cloud, Data Science, and machine learning.


Present tense

Today, due to the rapid growth of business, new needs have arisen to simplify the storage, management and analysis of data.

Solutions were required that would allow the implementation of ML and AI in business analytics and data management, while the data would become equally accessible to both data scientists and business analysts.

In this regard, new hybrid storage concepts have been developed: Data Lakehouse, which combines the advantages of a data lake and data warehouse, and decentralized Data Mesh.


So, in 1980–1990, DWHs became a response to the growing needs for data analytics and circumvented the limitations of simple databases designed to store and access specific information:

  • Expand the types of stored data for different subsystems, including BI

  • Save historical data and aggregated values

  • Take into account changes in data

Over time, the growth in the volume of generated information, the search for more flexible solutions and scalable technologies have led to the evolution of data architectures from Data Lake and Data Fabric to flexible, hybrid Data Lakehouse and Mesh.

Let's tell you about each in more detail.


What is a Data Lake and how is it different from a Data Warehouse?

Some of the main processes in data warehouse management are:

  • ETL – Extract, Transform, Load

  • ELT – Extract, Load, Transform

These processes involve taking data from disparate internal and external sources, optimizing it, transforming it, cleaning it, and moving it into a repository.

ETL

ETL

When implementing an ETL process, the data warehouse architecture consists of three components:

  • Data sources containing structured data in the form of tables

  • Temporary data storage area in which data transformation occurs

  • Data recipient: DWH data warehouse or database

DWH is a centralized repository for storing and further analysis of structured data.

The DWH corporate data warehouse allows you to update, normalize, enrich data and combine it from various information systems, tables and other external sources into a single structure.

As a result, no further data preparation is required, making it much easier for analysts and business users to access this data.

In enterprise storage, archival and historical data over different periods is stored in a format suitable for analyzing trends over time.

The emergence of the Big Data concept, the growth in the volume of information and the need to process not only structured but also unstructured data also influenced the methods of data processing, and the ELT approach began to develop.

Unlike ETL, in the ELT process, data is first extracted in “raw” form, then loaded into the Data Lake storage, and then converted into the required format in the storage.

Data Lake — a data lake, a storage facility that receives continuous streams structured, semi-structured and unstructured “big data”.

Unstructured data is data that lacks format and order: text documents, emails, images, videos, social media posts, sensor readings, log files, GPS data, etc.

This approach allows you to use data not only for business analytics, but also for artificial intelligence or machine learning tools.

The lake allows you to analyze a wider range of data in new ways to gain previously unavailable business insights.

Compared to DWH, the Data Lake architecture, when working with large volumes of data (arrays over 150 GB per day), provides more flexible scalability, high performance and low support costs.

Data Lake

Data Lake stack

Data Lake is built on the basis of distributed file systems (DFS), Hadoop and Spark processing frameworks, Apache Parquet or Avro data storage formats, object storage (Apache Ozon), NoSQL databases and other open-source tools.

Some large companies with distributed and branched business structures use both DWH and Data Lake, depending on their data management tasks.

Data Warehouse

Data Lake

Data Types

Structured, analytics-ready data

Data in raw, semi-structured or unstructured form and in any formats

Data relevance

Only the data necessary for specific business tasks

All company data, some of which may never be useful

Goals

Visualization, reporting, BI

Predictive analytics, machine learning, AI, BI, big data analytics

Processing

ETL – data is extracted from the source, cleaned, structured, and at the final stage ready for analysis

ELT – data is extracted from the source, stored in a data lake and then transformed as needed

If the company does not have plans for Big Data, Natural Language Processing (NLP), ML or AI analytics projects, implementing a Data Lake may not be practical.

To work with data in the lake, you must be able to interact with its components. This requires the skills of Data Science, Data Develop, business analytics, and data engineering, otherwise huge volumes of data risk turning into unmanageable and inaccessible flows of information.

In addition, Data Lake lacks some features necessary for operational analytics: transaction support, schema execution, and auditing operations.

All these factors force us to seek a compromise – a data architecture that provides the ability to store different types of data and at the same time easily access it for reporting.


Data Lakehouse: combining the benefits of Data Lake and DWH

The Lakehouse story begins with Databricks' proposed Medallion Lakehouse multi-tier architecture, which describes the transformation of data from its raw form to structured data for business intelligence.

Lakehouse combines the flexibility of lakes with the clear structure of classic storage. Apache CarbonData, Apache Iceberg, Open Delta, Apache Hudi are deployed on top of the Data Lake, which provide an additional layer for managing metadata and implementing transactions.

This concept allows use only one repository, where not only all types of data are located: structured, semi-structured and unstructured – but also all queries and reports are executed.

Lakehouse storage offers greater capabilities for Big Data analytics, while remaining manageable:

  • Supports ACID transactions and SQL queries

  • Supports ML and Python/R libraries

  • Gives access to different types of data, including images, video, audio, real-time data entry

  • Allows you to use BI tools without additional data processing

The Data Lakehouse concept is relatively new and its potential and capabilities are still being explored. This is a complex system that must be created from scratch to suit the needs of a specific business.


Data Fabric – ML-enhanced storage

The Data Fabric concept developed in parallel with the Lakehouse architecture and was designed to overcome the limitations associated with the time and complexity of searching, integrating and analyzing large volumes of disparate and multi-format data.

Data Fabric is an additional layer of data virtualization, an ecosystem consisting of real-time data processing technologies, APIs, MDM (Master Data Management), semantic networks (Knowledge Graph), artificial intelligence and machine learning (ML) tools and Data Governance techniques .

The storage is a single information space with configured end-to-end integration based on microservice architecture, where to build and optimize data management algorithms machine learning is used.

To obtain information about business processes, metadata is actively used – “data about data”, a set of descriptions and properties of data.

Typically, the Data Fabric approach is implemented with the implementation of ready-made commercial data platforms, for example, Arenadata EDP (Enterprise Data Platform).

Data Fabric makes it easy to access and manage data from a variety of sources, regardless of location or format.

Data Fabric

Data Fabric


Decentralization of data in Data Mesh

The Data Mesh organizational framework is different from the architectures listed above. The Data Mesh concept is focused more on data governance (Data Governance) and improving its quality, rather than on the technical features of building a warehouse.

The ideas of Data Warehouse and Data Lake are based on the centralization of data: from the aggregation in a central repository, which is owned and controlled, as a rule, by data engineers and the IT service.

This makes it difficult to democratize analytics data – making it accessible to all business users. In addition, the storage management team is usually small, which means that as the volume, number of sources and types of data grow, their processing becomes more complex and time-consuming, which affects quality.

Data Mesh is designed to solve these problems, as it is based on the opposite idea data decentralization – their distribution and ownership by functional divisions of the company (domains), such as sales, production, finance.
Each domain has its own IT team that works only with the data they own.

The data infrastructure is built in such a way that the storage is a self-serve platform where each domain team specialist can easily work with data independently.

The end result of each domain is called data product and has its own standards, metrics, a clear business goal, while being accessible and understandable for other domains.

With this approach, your data, team, and infrastructure are freely scalable, while providing a standardized and unified data experience across the company.

Data Mesh

Data Mesh

The Data Mesh concept is only suitable for large corporations that work with really large volumes of data, and involves a global change in the organizational structure, changes in business processes, and the emergence of new functional roles.


So, we looked at the main milestones in the development of data warehouses and the evolution of architectures that changed in accordance with the growing requirements of the business environment.

For regular reporting and business analytics purposes, a classic data warehouse (DWH) is ideal.
It helps to combine and bring into a single format data from information systems, databases and other sources: CRM, ERP, accounting systems, cash systems. Thanks to its structuredness and data optimization, QCD allows you to quickly access large volumes of information without significantly affecting the performance of information systems and operational databases.

Data Lake is a method of storing and processing “big data”.
Collecting structured and unstructured data enhances the possibilities for advanced analytics or the use of machine learning tools. The data lake becomes a particularly valuable tool in large concerns with a distributed structure, as it allows you to benefit from non-obvious sources of information.

The concepts of Data Lakehouse, Data Fabric and Data Mesh are the next levels of working with data, which cover not only the technology stack, but also the organizational structure of the company. The implementation of such architectures implies a complete revision of business processes in the company, but with proper implementation it increases the efficiency of all departments.

The final choice of technologies for working with data depends on their current and potential volume, the characteristics of the existing infrastructure, the scale of the team, and the company’s priority business goals.

In any case, the best solution would be to start with a pre-project survey – an assessment of business processes, data sources, and company infrastructure.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *