Knowledge of Scala is desirable, Spark is mandatory. What beginners and experienced data engineers need to be able to do: research from Yandex Workshop

Data engineering is a field that requires mastery of many tools. Therefore, one of the main questions is which of them to master first. Product researcher Rusana Talibova, together with the Yandex Workshop team, studied about a thousand vacancies for beginners and experienced data engineers on hh.ru, conducted a series of interviews with hiring managers and specialists of different grades, and compiled a list of the most in-demand skills in this area. We figure out where and how to upgrade ourselves in order to enter the profession and grow in it.

Demand for data engineers annually growing: according to Indeed, the number of vacancies for them has increased by 400% in five years. Companies are collecting more data than ever before, so they need specialists who can manage this growing colossus: developing warehouses and marts, designing pipelines, collecting, cleaning and structuring data, and so on.

Interest in the profession has rightly increased in the IT community. Today, developers and data analysts most often become data engineers. If you are considering this direction for yourself, then the findings of the study will be useful to you.

The main skills of a beginning data engineer

The main requirement for a junior – a specialist without work experience – is to know Python and SQL. Those who switch to the field from other specialties, as a rule, already master either both tools or one of them, plus they know something else additional – for example, they know how to work with dashboards.

It is Python and SQL that we first introduce to students who come to Yandex Workshop to master the profession of data engineer from scratch. And for the more experienced, we help systematize and deepen their knowledge.

In third place in demand is English (found in 24% of vacancies for candidates without experience). In large companies with international teams, knowledge of written and spoken English is mandatory.

Skills that are often mentioned in vacancies on hh.ru for positions for data engineers without work experience.

The minimum skill set of an experienced junior and middle data engineer, depending on the company and experience, includes the following tools:

— Apache Hadoop

— Apache Spark

— Apache Airflow

— Greenplum

— HDFS

— ETL/ELT tools in general

— DWH

What is added to juniors+/mids

Often, growth in a grade is not associated with an increase in technologies in the stack. Through a series of interviews with data engineers and analysis of the difference in positions in vacancies, no clear differences were found in the tools and range of tasks for juniors, middles, and seniors – they are universal for all data engineers. As a specialist begins to perform increasingly complex and multi-stage tasks, new tools may appear in his stack, but this greatly depends on the field in which he works. In companies where the main focus is not on digital products and there are no IT specifics, they can start using something later.

Skills that are often mentioned in vacancies on hh.ru for positions for data engineers with 1–3 years of experience.

Among the new in-demand skills for juniors+ and middles are the Scala programming language and the Spark framework, needed for working with big data, of which there is now a lot in different companies. The ability to work with Scala significantly increases your chances of getting hired. Scala is not yet often found in job descriptions, but is already well-known in the industry. Even if the language is not mentioned in the candidate requirements, the question about it can arise during an interview and become a significant advantage for the employer. And Spark is often called essential for entry-level data engineers.

Key skills of a middle-level data engineer

Two skills are added to the list of skills for experienced middles:

Automation and Python
MongoDB as a new data source.

Skills that are often mentioned in vacancies on hh.ru for positions for middle data engineers.

Key skills of a senior data engineer

Skills that are often mentioned in vacancies on hh.ru for positions for senior data engineers.

What you need to know to grow in the grade

You can learn additional tools on your own or in specialized courses. For example, in Yandex Practicum, programs for data engineers are supplemented with sections about tools that are gaining popularity, and those who have already completed their training get access to new materials.

In data engineering, as in other IT areas, specialists grow in two variations:

1. Technical. A specialist can perform more complex tasks with existing tools in a minimal stack (for example, a deeper dive into Python, Hadoop, Spark, etc.). In addition, you can start learning a skill that gives you an advantage in the market, or a technology that is currently developing in the field (for example, Apache Kafka and other tools necessary for working with streaming).

2. Software. It involves the development of soft skills – organizational skills, and as a separate stream – management for senior data engineers.

Data engineer skill sets include many tools, and it has become too difficult to confidently master all the technologies one might encounter. It is useful to imagine what stack companies work on, where you would like to work, and what tasks you will perform most often.

Today, data engineers regularly encounter streaming and batch data processing. These are two large areas of working with information, and each of them includes its own set of tools.

Batch data processing is where data engineering began. With this processing, changes are accumulated at the source and then simultaneously (in a batch) sent to the analytical system – for example, once an hour or day. What you need to know in order to complete tasks in this classic approach:

SQL and classic relational DBMS – PostgreSQL, Oracle, MySQL and so on. A list of the most popular databases can be tracked Here. In addition, when working with databases, it is important to distinguish between transactional (OLTP, single-point intensive modification of records) and analytical (OLAP, processing and analysis of large arrays of records) work with data.
There are also NoSQL and NewSQL – types of DBMS that differ from the classic ones (document-oriented, graph, key-value, and so on).
Data storage and processing tools that include support for parallel processing. For these processes, Hadoop, various MPP DBMSs are used, as well as Spark, an open framework for distributed processing of unstructured and semi-structured data.

The second direction is streaming (realtime streaming pipeline), which allows you to process and analyze data streams in real time. Stream data processing is a current trend in analytics. There are completely different principles and worldviews here. Technologies: NiFi, Kafka Streams, Spark Streaming, Flink.

In addition to the above, it is important for a data engineer, like any person who deals with data, to be confident in the Linux command line.

A data engineer will inevitably have to deal with cloud architecture. Working with infrastructure differs from company to company; some use solutions supplied by cloud vendors, for example, dataproc/data vault in Yandex Cloud. Somewhere they deploy opensource software or software from third-party vendors in the cloud or their data centers (Arenadata distribution, Postgres Pro and others). Some international companies still use Western software vendors and cloud providers (Databricks, Cloudera, Snowflake; AWS, GCP). Which solution you get depends greatly on the company, but the transition between most vendors is quite straightforward: the underlying principles (and technologies, the same Postgres (-> Greenplum, Vertica, RedShift/Aurora), Hadoop) are the same for many.

Soft skills are needed not only by the team lead and manager

Developing soft skills is not only a way to grow into a team lead, manager, or process organizer. Soft skills are also needed for development if a specialist chooses to delve into the technical side.

Already in June’s skill set, such a quality as the ability to communicate is indicated. After all, without the skill to request and receive feedback, he will not be able to get used to the profession and grow in his grade. Some companies prefer to organize work so that a specialist does not just complete tasks in Jira, but solves business problems. He must understand his role in the company, know where the team is heading and what kind of product it makes. The ability to perform work based on the company’s business processes can be a critical requirement for an employee of any grade, from junior to team lead.