What are they and when should you use each?

There is more and more data. It is important to be able to effectively store and process them to solve complex business problems. One of the first steps towards a successful strategy is choosing technology for storing, searching, analyzing and reporting data. How to choose between Database, Data Warehouse and Data Lake? Let's look at the key differences and when to use each.

What is Data Warehouse, Database and Data Lake

DWH (Data Warehouse) is a repository with all its components; data is collected into it from various sources. DWH also includes BI tools that generate final reports. This is an end-to-end story: from loading data to displaying specific information based on it.

Database is an element of DWH, its purpose is to store data in an orderly manner. They come in different types: SQL and NoSQL, relational and non-relational, but the essence is simple – storage and access to certain information.

There is an approach to combining data from different sources called ETL – Extract, Transformation, Load. In it, data is extracted, transformed and loaded into the target source, the Database. There is another approach – ELT (Extract, Load, Transformation). Unlike ETL, here the data is first extracted, then loaded into the so-called Data Lake, and then converted into the required format.

Data Lake is an approach to architecture in which an intermediate storage is created where raw data from all sources is dumped. Inside this storage they are processed, aggregated, decomposed, composited, normalized and sent to the database. Data is collected from all sources and then stored in the Data Lake. Next, they are converted and sent to the main storage, where all this is stored in a beautiful form, without unnecessary information. At the same time, given that a copy of the data is saved, you can always return to it. If the transformation turns out bad, there is data from the source that you can go to and reassemble again.

Whether to use Data Lake or not depends on Jobs To Be Done. Two main tasks that Data Lake solves: storage of historical data from primary sources, and subsequent analysis and data processing, which is used in Big Data and ML.

If there are no such tasks, then with a high degree of probability Data Lake is not needed. He will only waste extra power; he will have to pay for servers. It makes more sense to simply implement ETL and pour the data directly into the database.

Comparing Data Lake, DWH and DB

Data Lake vs. DWH

Data Lake and DWH are used to aggregate multiple data sources, but they have different purposes and design.

A Data Lake collects data from various sources, whether they are structured or not. This allows the data to be used for artificial intelligence and machine learning.

DWH is a centralized location for analyzing structured data for specific business intelligence purposes.

Data Lakes are schemaless and more flexible for storing relational data from business applications, as well as non-relational logs from servers and, for example, social networks. DWHs, on the other hand, rely on schema and only accept relational data.

It's good when an organization uses both tools. This rarely happens due to the high cost.

DWH vs. Database

DWHs and databases store structured data but were created to accommodate differences in scale and number of sources.

A database is primarily used in a monolithic environment where data is generated by a single application. DWH is designed to support large volumes of data from all departments of an organization.

Both support query languages and reporting capabilities. Mainly used by business analysts.

Features of the problems to be solved using the example of agrotech

Data related to crop yields, weather conditions, pesticides and more require DWH. Thanks to this method of storing data, engineers and business analysts can identify and correct deficiencies in the ecosystem such as soil quality problems and unnecessary use of pesticides.

Data Lake is suitable if complex data analysis is required, since the lake is designed for efficient storage of Big Data. Data Lake is also needed when ML works with this data. For example, use satellite images and weather data to predict farm yields, taking into account such parameters as the condition of crops and irrigation needs. This helps farmers use the knowledge they gain to make decisions.

What problem did we solve with Data Lake?

Smartup engineers participated in the development of Data Lake on the “Shop Advizor” project. The idea was this: information was collected from user phones via bluetooth and wi-fi. Based on it, the user’s profile was determined and targeted advertising was shown to him. All this raw data was put into the Data Lake, processed and sent further in the form of reports. The data was analyzed and the user and his interest were extracted from it. This is how the information entered the database.

Conclusion – how to choose a method for storing data?

There are 3 things to consider:

1. Important factors are the number of data sources and the format in which they come. Data Lake accepts unstructured data. DWH – structured only. Databases work best when there is a single source of structured data.

2. Data Lake provides flexibility for storing raw data. DB and DWH require ETL processes in which raw data is converted into a predefined structure.

3. The method of its placement depends on who will use the data. If the primary use case is business intelligence, DWH is suitable, but with higher data storage costs. If a business analyst is proficient in SQL and needs to create a trend report for only one part of the business, a relational database is the best choice. However, this does not mean that the company uses only one method of storing data; all three can be combined depending on business tasks.

What are they and when should you use each?