The most sought-after skills in the data engineer profession

According to 2019 statistics, data engineer is currently a profession, the demand for which is growing faster than all others. Data engineer plays a critical role in the organization – creates and maintains pipelines and databases that are used to process, transform and store data. What skills do representatives of this profession need first of all? Is the list different from what is required from data scientists? You will learn about all this from my article.

I analyzed the vacancies for the data engineer position in the form in which they were in January 2020, to understand what skills in technology are most popular. Then I compared the results with the statistics on the vacancies in the data scientist position, and some interesting differences were revealed.

We can do without long introductions – here are the top ten technologies that are most often mentioned in job texts:

Mentioning technology in jobs for data engineer position in 2020

Let’s get it right.

Responsibilities of data engineer

Today, the work that data engineers perform is of great importance for organizations – it is these people who are responsible for storing information and bring it in such a way that other employees can work with it. Data engineers build pipelines to streamline receiving data, stream or packets, from multiple sources. Next, the pipelines perform operations of extraction, transformation and loading (in other words, ETL processes), making the data more suitable for further use. After that, the data is transferred to analysts and data scientists for deeper processing. Finally, data ends its journey on dashboards, reports, and machine learning models.

I was looking for information that would allow us to conclude which technologies are most in demand in the work of data engineer at the moment.

Methods

I collected information from three sites to find work – SimplyHired, Indeed and Monster and looked at which keywords came in conjunction with the “data engineer” in the texts of vacancies for US residents. For this task, I used two Python libraries – Requirements and Beautiful soup. Among the keywords, I included both those that were on the previous list for analyzing vacancies for the data scientist position, and those that I manually selected while reading job offers for data engineers. LinkedIn was not included in the list of sources, since I was banned there after my last attempt to collect data.

For each keyword, I calculated the percentage of hits from the total number of texts on each of the sites separately, and then I calculated the average value from three sources.

results

Below are thirty technical terms from data engineering with the highest scores across all three job sites.

And here are the same numbers, but designed in the form of a table:

Let’s go in order.

Results Overview

Both SQL and Python appear in more than two-thirds of the vacancies reviewed. It is these two technologies that make sense to study first. Python – A very popular programming language used for working with data, creating websites and writing scripts. SQL stands for Structured Query Language (structured query language); it assumes a standard implemented by a group of languages, and is used to extract data from relational databases. He appeared a long time ago and has established himself as highly stable.

About Spark says about half of the vacancies. Apache spark “It is a combined analytic engine for processing big data with built-in modules for streaming, SQL, machine learning and graph processing.” It is especially popular with those who work with large databases.

AWS appears in approximately 45% of job postings. This is Amazon’s cloud computing platform; It owns the largest market share among all cloud platforms.
Next come Java and Hadoop – a little over 40% for the brother. Java – A widely spoken, battle-tested language that 2019 Stack Overflow Developer Survey won the tenth place among languages ​​that cause horror for programmers. In contrast, Python turned out to be the second most loved language. Java runs the Java language, and everything you need to know about it at all can be understood from this screenshot of the official page of January 2020.

Like a time machine
Apache hadoop uses the MapReduce programming model with server clusters for big data. Now this model is increasingly being discarded.

Next we see Hive, Scala, Kafka and NoSQL – each of these technologies is mentioned in a quarter of the vacancies presented. Apache Hive is a data warehousing program that “simplifies the reading, writing, and management of large datasets located in distributed storages using SQL.” Scala – A programming language that is actively used when working with big data. In particular, Spark was created on Scala. In the already mentioned ranking of fearsome languages, Scala is on the eleventh line. Apache kafka – A distributed platform for processing streaming messages. Very popular as a means of streaming data.

NoSQL Databases contrast themselves with SQL. They differ in that they are not relational, not structured, and have horizontal scalability. NoSQL has gained some popularity, but the feverish craze for this approach, right up to the prophecies that it will replace SQL as the dominant storage paradigm, seems to be over.

Comparison with terms in data scientist vacancies

Here are the thirty technological terms most commonly used by employers in the field of data science. I got this list in the same way that I described above for data engineering.

Mentioning technology in jobs for data scientist in 2020

If we talk about the total number, compared with the previously reviewed set, there were 28% more vacancies (12 013 against 9396). Let’s see what technologies are less common in vacancies for data scientists than for data engineers.

More popular in data engineering

The graph below shows keywords with an average difference in values ​​greater than 10% or less than -10%.

The biggest differences in the frequency of keywords between data engineer and data scientist

The most significant increase is found by AWS: in data engineering it appears 25% more regularly than in data science (approximately 45% and 20% of the total number of vacancies, respectively). The difference is palpable!

Here is the same data in a slightly different representation – on the graph, the results for the same keyword in the vacancies for the data engineer and data scientist positions are located side by side.

The biggest differences in the frequency of keywords between data engineer and data scientist

The next largest leap I noticed at Spark – a data engineer often has to work with big data. Kafka also grew by 20%, that is, almost four times compared with the result for the data scientist vacancies. Data transfer is one of the key responsibilities of a data engineer. Finally, the number of references turned out to be 15% more in the field of data engineering for Java, NoSQL, Redshift, SQL and Hadoop.

Less popular in data engineering

Now let’s see which technologies are less popular in job vacancies for data engineer.
The sharpest decline compared to the field of data science occurred in R: there he appeared in about 56% of vacancies, here – only in 17%. Impressive. R is a programming language that is popular with scientists and statisticians, as well as the eighth-place winner in the ranking of terrifying languages.

SAS also found in vacancies for the data engineer position significantly less – the difference is 14%. SAS is a proprietary language designed to work with statistics and data. An interesting point: judging by the results my job research for data scientists, lately he has lost much of his position – stronger than any other technology.

Demanded in both data engineering and data science

It should be noted that eight of the ten first positions in both sets are the same. SQL, Python, Spark, AWS, Java, Hadoop, Hive, and Scala are among the top ten for both the data engineering industry and data science. In the graph below, you can see the fifteen most popular technologies by data engineers employers, and next to them is their vacancy rate for data scienctists.

Recommendations

If you want to engage in data engineering, I would advise you to master the following technologies – I list them in order of approximate priority.

Learn SQL. I persuade you specifically to PostgreSQL, because it has open source code, is very popular in the community and is in a growing phase. You can learn how to use the language from the book My Memorable SQL – its pilot version is available here.

Learn Python, albeit not at the most hardcore level. My Memorable Python is for beginners. It can be bought at Amazon, electronic copy or physical, of your choice, or download in pdf or epub on that website.

Once you get familiar with Python, move on to pandas, the Python library that is used to clean and process data. If you are focused on working in a company that requires the ability to write in Python (and most of them are), you can be sure that knowledge of pandas will be assumed by default. I’m finishing the pandas introductory tutorial now – you can subscribeso as not to miss the exit moment.

Master AWS. If you want to become a data engineer, you can’t do without a cloud platform in the zashnik, and AWS is the most popular of them. The courses helped me a lot. Linux Academywhen i was studying data engineering on Google CloudI think that on AWS they will have good materials.

If you have already mastered this entire list and want to grow in the eyes of employers as a data engineer, I suggest adding Apache Spark to work with big data. Although my research on data science vacancies has shown a decline in interest, it still flashes for data engineers in almost every second job.

In the end

I hope this review of the most popular technologies for data engineer seemed useful to you. If you’re curious about how analysts are at work, read my other article. Successful engineering!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *