Roadmap for Data Engineer in 2023

How I would study Data Engineering in 2023 (if I could start over)

  Big Data Landscape - https://mattturck.com/data2021/

Starting a career in Data Engineering can be overwhelming due to the sheer number of tools and technologies available on the market.

Questions often come up: “Should I learn Databricks or Snowflake first? Should I focus on Airflow or Hadoop?”

In this article, I’ll walk you through everything from basic to advanced, all the resources and skills you’ll need to become a Data Engineering professional.

I divided the steps into 3 categories:

  1. For people who are completely new to the field and want to retrain in Data Engineering from other areas.

  2. For people who are familiar with some of the basics and want to know how to go further.

  3. For people who have some experience and want to develop in their career.

Step 1 – Exploring the Unknown
Do you want to retrain in Data Engineering but feel overwhelmed by the number of tools and technologies available on the market? You are not alone. Many people find themselves in the same situation, whether they are non-tech jobs, students or newbies, or work in a different tech area and want to change careers.

If you fall into one of these categories, the first thing you need to do is master the basics of computer science.

If you are completely new to this field, you need to understand the basic concepts and terms used in computer science before you start learning Data Engineering.

An excellent resource for this is the lecture series available on YouTube provided by Harvard CS50:

https://youtu.be/4MIBGO9YnCg

By studying this series of videos, you will gain a basic understanding of computer science without having a degree or spending months learning the basics.

Once you have mastered the basics of computer science, you can move on to the next step: learning the skills required for Data Engineering.

There are three fundamental skills required for Data Engineering:

Programming Languages: As a data engineer, you will be writing a lot of transformation, script deployment, validation, and testing assignments, for which you need to be proficient in one programming language. Three popular options are Java, Scala and Python. If you are a beginner, Python is a great option as it is easy to learn and understand.

SQL: Structured Query Language is king in the data industry. Whether you work as a data analyst, data engineer, or data scientist, you are bound to use SQL frequently. SQL is a way to communicate with a relational database, and it’s important to understand how to use it to select, insert, update, and delete data.

Linux Commands: Most data engineering tasks are performed on remote machines or cloud platforms, and these machines tend to run on Linux operating systems. It is important to understand how to work with these machines and understand the basic Linux commands.

Step 2: Building a solid foundation
At this stage, your goal should be to learn the minimum skill set required for Data Engineering and how to start a career as a data engineer.

You don’t have to spend time learning every skill or tool available on the market; you just need to focus on what you need to perform the basic functions of your profession.

In this step, we will focus on building a solid foundation for Data Engineering.

The first fundamental skill to focus on is understanding how data warehouses work.

This skill has two parts:

  • Learning the basics of how data warehouses work

  • Learning tools like Snowflake or BigQuery.

Data warehousing fundamentals typically include understanding OLTP, dimension tables, extracting, transforming, loading, and modeling data such as understanding facts and dimensions.

If you prefer to learn from books, you can read “The Data Warehouse Toolkit” by Ralph Kimball.

This book is one of the best books written about data warehouses.

After learning the basics of the data warehouse, you can apply what you’ve learned to a specific tool.

There are many different data stores available in the market such as Snowflake, BigQuery and Redshift.

It is recommended to explore Snowflake as its demand is growing day by day.

In addition to understanding data storage, you also need to understand data processing frameworks.

There are two main data processing frameworks:
– Batch Processing: processing data in batches, such as processing data from the last month once or twice a day.
– Real-time processing: processing data as it arrives in real time.

For batch processing, most companies use Apache Spark. It is an open data processing framework.

You can start by learning the basics of Apache Spark and then get familiar with a tool that runs on Apache Spark like Databricks, AWS EMR, GCP Data Proc, or any other tool you find on the market.

My advice is to practice with SparkCon and Databrick using PySpark (Python) as the language.

For real-time data processing, we have frameworks and tools such as Apache Kafka, Apache Flink and Apache Storm. You can choose one and study it.

We learn by breaking down different problems into smaller pieces.

First we focus on learning the fundamentals and then we learn one of the highly sought after tools on the market so you can apply your fundamental knowledge.

The third skill you need to master as a data engineer is learning about cloud platforms. There are three main choices available here – Microsoft Azure, Amazon Web Services (AWS), and Google Cloud Platform (GCP).

Top cloud platforms

Top cloud platforms

I started my career with AWS, but you can choose any cloud platform because once you learn one, it will be easier to master others. The fundamental concepts of cloud platforms are similar, with slight differences in user interface, cost, and other factors.

In Data Engineering, you need to create data pipelines to process your data. Data pipelines, also known as ETL pipelines, are used to extract data from a relational database, apply transformations and business logic, and then load the data into a target location. To manage these operations, you need to learn workflow management tools.

One popular choice is Apache Airflow.

Airflow is an open source workflow management tool that allows you to create, schedule and track data pipelines. It is widely used in industry and has a large user community. By learning Airflow, you will be able to create data pipelines and automate the ETL process, which will make your job as a data engineer much easier.

Step 3: Modern Data Stack and Advanced Skills
As a data engineer, there are many different tools and approaches available in the market.

It is important to stay informed and learn everything. In addition, you also need to learn how to design the entire data infrastructure, manage and scale the system, and master advanced skills.

In this step, we will focus on learning the advanced skills required for Data Engineering.

The first thing I recommend is to research MDS(Modern Data Stack)

Modern stack for work

Modern stack for work

This is a list of tools that you can learn more about and understand their main purpose.

One of the tools that I highly recommend looking into is DBT (Data Build Tool) as many companies use it and it is becoming more and more popular in the market.

However, it’s important not to get too attached to too many tools, just to understand the basic use of each one.

Another important aspect is an understanding of security, networking, deployment, and other related topics.

You can also learn about Docker or Kubernetes, which are useful when deploying data pipelines in a production environment.

I also recommend reading this book:

In addition, reading customer success stories on platforms such as AWS and GCP can help you better understand how to use these tools in real world scenarios.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *