Are you in a vault or a lake? What Data Scientists Do and How to Become a Data Engineer

webinar about the profession of data engineer. Slurm hosted this webinar for the Data Professionals community “Where’s the date, Zin?”.

At the meeting, experts described the DWH and Data Lake approaches, and then talked about professional roles directly related to processes in such storages: Data engineer, ETL engineer, DWH engineer, data scientist, ML engineer, data analyst.

While the speakers of the webinar shared their opinions and cases, the presenter made scheme. It allows you to quickly understand what they do, what they do not do and what skills the above-mentioned specialists have.

In the article, we will first describe the patterns for building distributed storages in order to understand what processes the data goes through. And then we’ll talk about the tasks of data scientists and the skills required for each position.

Hidden in the picture are 5 tools for working with data.  Did you recognize them?

Hidden in the picture are 5 tools for working with data. Did you recognize them?

DWH vs Data Lake

Data Warehouse (DWH) is a repository in which data is collected from different source systems. With DWH, you can analyze historical and current data. The idea of ​​such a repository appeared when it became clear that running analytical queries in a product database, for example, the backend of a website, is long, expensive and risky. For analysis, data must be moved from the product circuit to the analytical one.

Once this issue was solved by uploading to Excel so that specially trained people would work with files. Now, when it comes to big data, we can quickly receive changes in the product contour in the analytical one. DWH is one of the patterns that are used for this. This is an OLAP (Online Analytical Processing) storage (as opposed to OLTP – Online Transaction Processing), that is, one more database or several. In order to be able to work with DWH, you need to set up the process of copying data from the product loop to the analytical one.

The DWH pattern assumes that structured data, their structure is preliminarily transformed. This data transfer process is called ETL (extract > transform > load).

DWH has one limitation. Since this is a structured data store, they need to be prepared before being uploaded to DWH. Preparation can take quite a long time, so uploading new portions of data often takes place periodically. In modern realities, it is sometimes possible to process and upload data directly to DWH (say, from Kafka directly to Clickhouse), but maintaining such a system in production can be quite difficult.

DWH tools:

data lake

It is not always possible to structure data before sending it to the storage, since not all sources can guarantee data transfer according to a certain scheme. For example, a data bus with JSON files that are collected from REST is semi-structured data, especially when they can be represented as documents of different levels of nesting and with different sets of attributes.

data lake This storage semi-structured data obtained from various sources. Loading data into Data Lake occurs in a “raw” form, and transformation occurs after they appear in the storage. The ELT process works (extract > load > transform). Using the Data Lake pattern, you can start writing data as you work, and transform it for further analysis after.

Such a storage can contain both structureless (or semi-structured) data, such as pictures and videos, and more structured data, such as Parquet files. In addition, Data Lake can be used as a long-term storage of historical data with the ability, if necessary, to “raise archives” for a certain period in the past. Such “archives” would be suboptimal and expensive to store entirely in DWH.

Data Lake tools:

Data Warehouse (DWH)

data lake

Structured data

semi-structured data

ETL (extract > transform > load)

ELT (extract > load > transform)

The difference between DWH and Data Lake

The difference between DWH and Data Lake

There is also a modern combined version of the data platform – Data Lakehouse. In this case, the data is in a semi-structured storage, to which we add a solution from the vendor, which allows us to get more structure, more metadata, or get closer to real time.

Tools:

Next, we will consider professional roles associated with processes in distributed data warehouses. We list the tasks and required competencies for an ETL engineer, DWH engineer, data scientist, ML engineer, data analyst and Data engineer, and show their career track.

ETL engineer

The tasks of an ETL engineer are to connect different data sources (API, queue servers, databases, files of different types) to the storage and convert the received data. The transformation includes structuring for placement in a common repository, removing duplicates, enriching data based on additional sources.

What need to know and be able to ETL engineer:

  • processes, features of the tools used;

  • regular expressions;

  • SQL

  • visual programming (NiFi);

  • git

  • programming (Python/Java/Groovy).

Most often, this position comes from system administration, back-end development and analytics (business intelligence, data analytics).

What does not ETL engineer:

  • does not design DWH (this is done by a DWH engineer);

  • does not predict and is not engaged in Data Science (tasks of a data scientist);

  • does not develop stateful and streaming analytics (data engineer tasks);

  • does not deploy platforms (this is done by a DataOps engineer, a Platform engineer);

  • does not write a REST API to collect data from REST clients (backend developer tasks).

DWH engineer

The task of a DWH engineer is to design a robust repository structure that can grow with the business. It uses the Single source of truth (SSOT) principle to keep the data up to date. He also sets up storages, including setting up indexes, independently or with the help of DevOps engineers, sets up DWH monitoring and logging. It also generates business keys for different objects and Customer360 – a presentation of all customer data on one screen.

What need to know and be able to DWH engineer:

  • know the features of the tools;

  • SQL (including the use of DBT);

  • tuning Nix-system;

  • understanding of data architecture;

  • git

  • Spark.

Most often people come to this profession from DBA, DB-development, data analytics.

What does not DWH engineer:

data scientist

A data scientist trains models. Its task is to form a model (statistical representation) based on the available data that allows you to make decisions for your business. The model can predict customer churn, staff workload, audience segmentation, or create an advanced user recommendation system. And this is only a small part of the possible tasks.

Also, the data scientist finds non-obvious patterns. For example, based on the digital footprint of a user, it can be classified. In addition, data scientists use Computer Vision. This technology provides many opportunities, for example, you can evaluate the behavior of customers in a store or conduct primary medical diagnostics using MRI images.

What need to know and be able to data scientist:

  • have a strong mathematical background;

  • machine learning;

  • Deep Learning and neural networks;

  • generative models;

  • NLP (Natural Language Processing – working with texts, natural language);

  • SQL

  • git

  • programming language (Python, R, MATLAB).

Most often, data scientists come to the profession from mathematicians, data engineers or data analysts.

What does not data scientist:

  • does nothing unrelated to its tasks

  • does not write product code.

ML engineer

An ML engineer helps a data scientist prepare data for training models, accompanies the life cycle of models (supports models in production). It also provides tools for the effective work of a team of data scientists.

What should know and be able ML engineer:

  • good at programming (Python/Scala/Java; C++/Go; SQL; Git);

  • ML libraries;

  • web and backend skills (must be ready to collect data);

  • streaming;

  • Spark/Databricks;

  • distributed storage (Kafka);

  • Docker;

  • K8s.

Most often, backend developers, DWH and DB developers, DWH engineers, and data scientists come to the role of an ML engineer.

data analyst

Analysts know the subject area and product well, communicate with business customers. They visualize data, look for insights, and generate ideas.

What need to know and be able to data analytics:

  • knowledge of the subject area and product;

  • BI tools (Power BI, Tableau, Superset);

  • SQL

  • excel;

  • soft skills.

Most people move into data analysis from business analytics, data science, back-end development, system administration, and project management.

Data engineer

We conclude the review of professions with a description of a Data engineer. We left it at the end of the material, because the Data-engineer (data engineer) is now a position with a vague terms of reference, which may include the functionality of all the roles listed above. The list of tasks for a Data Engineer can vary greatly depending on the current business priorities.

What can a Data Engineer do:

  • automate the work of ETL and DWH engineers, write tools that generate code;

  • write pipelines, the task of which is to extract data from product sources (from one database or several) and transfer them to DWH;

  • implement Data Lake;

  • set up streaming, stateful processing.

What need to know and be able to:

  • good at programming (Python/Scala/Java; SQL);

  • streaming platforms and distributed systems, storages;

  • Spark/Databricks/Hadoop

  • be able to do what is described in the roles above.

The Data Engineer profession comes from back-end development, DWH development, DB development, DBA, DevOps.

Summarize

By itself, the accumulated data is useless. The data engineer makes them available for analysis in the right form, in the right place, and with the right rights for other users, so that data scientists and data analysts can gain insights and create new business opportunities.

The functions of a Data engineer may include a different set of tasks, depending on the processes in the company, we described some of these tasks in the article as part of other roles.

A few words about career development: a Data Engineer can develop in the direction of management and at the next stage lead a team or several, or can delve into technology and move to the role of a Data Architect.

Where’s the date, Zin?

Slurm created a secret-non-secret community of Data-specialists “Where’s the date, Zin?” Every week we will meet online to discuss data engineering and working with data, ask questions to data masters, and chat with each other. And also – with the benefit and fun to spend time.

Today, August 24, at 19:00, the second meeting of our community will be held on the topic “Data Modeling in DataLake: from the Tower of Babel to Data McDonald’s”.

Let’s discuss:

  • What is data modeling?

  • Typical Data Lakehouse

  • Logic Modeling: EDW, Dimensional Modeling, Data Vault

Become a member→

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *