what to expect and how best to prepare (part 2)

In the previous article, we talked about the role of the Data Engineer in X5, what tasks it solves and what technology stack it works with. We reviewed the structure of the interview, the main areas in which we evaluate candidates, and discussed in detail the basic requirements that we set for the level of Python proficiency.

In this article, we will analyze the requirements for the key skills for the Data Engineer in X5: distributed systems and computing on Hadoop / Spark, as well as SQL and data schema design.

It is important to note that here our expectations are not limited to basic knowledge and the ability to solve typical problems. We want to see a deeper level of understanding and mastery of these technologies, the ability to solve complex or non-standard problems.

In product teams, many people know how to work with Hadoop, Spark and relational databases at a basic level: Data Scientists, Data Analysts and Data Quality. To do this, we put a lot of effort into training employees: we send them for training to external providers, invite external experts with lectures, organize master classes, develop and teach special programs within the internal digital academy.

Thus, the Data Engineer in X5 is not the person in whom all tasks are poured when data is needed, but the one who is called to solve more complex and non-trivial technical problems, where the basic knowledge of the relevant technologies is no longer enough. For these reasons, we look at interviews how deeply the candidate understands what is happening under the hood of a particular technology. Without this knowledge, he is unlikely to be able to successfully cope with our tasks.

We try to build the discussion around open-ended questions, and during this conversation we ask on narrower topics. And if such a discussion with each step begins to touch on more and more complex topics, then this means only one thing: you are doing a great job and each new complication gives you additional points.

Let me give you a few examples of such questions.

Where problems come from and how to solve them

Your Data Analyst rookie colleague is about to build a new mart and asked you to look at his Spark script code. There is a fragment in the code where it is supposed to partition by a field of type double with a range of possible values ​​from zero to one billion. What do you say to that?

Here we want to hear not a short answer like “You can’t do this” or “Norm, let’s start”, but a more detailed answer, in which the following reasoning can be traced:

  • Partitioning on this field will create a potentially huge number of small files in HDFS.

  • Having hundreds of millions of small files can lead to problems for the entire Hadoop cluster.

  • An issue with a Hadoop cluster can arise from a NameNode failure.

  • NameNode failure can occur because meta-information about the file system is stored in the NameNode’s RAM, and its size may simply not be enough.

If you were further asked to develop a topic, for example, in the direction of how you can ensure high availability of the NameNode and get rid of a single point of failure, then that means that you are doing great, and it makes sense to touch on more complex topics.

Your Data Analyst rookie colleague has approached you again for help. This time, he has a problem with the fact that the algorithm he developed works fine on sampled data from a large dataset, but many problems arise on the entire huge dataset. What are some of the common problems that come to mind?

This is a very general question, during which we would like the candidate to, among other things, address the topic of data imbalance and how to deal with it. The topic of imbalance provides an excellent context for a conversation about finding bottlenecks, optimizing computations, about different types of join in Spark, how they differ when applied, and the like.

Advice: remember and tell us about the most difficult, in your opinion, problems and mistakes that you encountered, how you solved them and what you learned. If you know about the problem of small files in hadoop not in theory, but because at one time you flunked the cluster, and then thoroughly figured out why you should not do this, this will give you additional points in our eyes.

Understanding how technology works under the hood

You need to count the number of unique values ​​in a certain numeric field, which is of type Long (8 bytes). What request would you write for this? Tell us how Spark will fulfill this request, as granularly as you can.

This question, for all its seeming simplicity, has several nested levels of complexity and allows you to fairly accurately determine the position of the candidate on a scale from “can write a query” to “has knowledge of rare depth.” Let me remind you once again that each new round of complications, when you are asked to tell in more detail about certain aspects, means that you are on the right track, and so far everything is going well.

Advice: Remember the cases from your practice, when you had to optimize calculations and delve into how exactly Spark executes queries. We will ask you about such cases, and then the conversation can be built around them, which will be much more comfortable for you than general questions.

SQL and data schema design

We usually ask the candidate to design a small schema and write a SQL query to solve a specific problem. The goal is deliberately stated in a very general way, and the applicant is expected to start by clarifying the requirements and constraints first before proposing a solution or writing any code.

You are asked to design a data schema and write queries to form a dashboard for RTO (retail turnover) and the marginality of the stores of the new retail chain “Desyatochka” by months in the context of cities and product categories.

Here we expect that the candidate will first clarify the problem statement, for example, like this:

  • How many products, stores and categories? – 100K products, 20K stores, 300 categories.

  • How many sales per month? – A couple billion records every month.

  • For how long do we store data? – For several years, we have been accumulating new sales data.

  • How often is the data updated? – Once a month.

  • What kind of requests are you supposed to handle? – Usage scenarios can be different: for example, to see how the turnover and margins have changed on a monthly basis for certain categories over the past three years, or to see the top 10 cities by turnover for all categories over the last month.

Advice: if you find it difficult to come up with a solution that satisfies all of the given conditions, select the part of the requirements for which you can offer a solution and focus on them. Be clear about which constraints your option satisfies and which not – this is a perfectly normal scenario.

One of the possible solutions that can be proposed based on the received input is given below.

General approach to the solution

There are a lot of sales per month (over two billion), I want to store and accumulate their history. For these purposes, the Hive database is well suited, but it will not cope with fast data analytics for a dashboard, so we can implement the solution in two stages:

  1. We keep a raw history of sales in Hive.

  2. Raw sales can be aggregated in the form we need (month, city, category) and loaded into a faster database for analytics (OLAP). ClickHouse can be used as such a base, it is able to quickly give results for analytical queries.

Data schema

The approach to the solution has been proposed, now we sketch out a possible structure of the schema for storing data. Let’s start with Hive.

sales:
   sku_code: int
   store_code: int
   selling_price: float
   cost_price: float
   sale_date: date

stores:
  store_code: int
  store_city: string

skus:
  sku_code: int
  sku_category: string

Теперь перейдем к ClickHouse.
 
sales_by_city_category:
    sku_category: String
    month: Date
    store_city: String
    rto: Float64
    margin: Float64

Inquiries

Request for raw download data in Clickhouse.

select sum(selling_price - cost_price) as margin
       sum(selling_price) as rto,
       store_city,
       sku_category,
       month(sale_date)
from sales join stores on sales.store_code = stores.store_code
join skus on sales.sku_code = skus.sku_code
group by store_city, sku_category, month(sale_date)

Запрос для загрузки новых данных за очередной месяц.

select sum(selling_price - cost_price) as margin
       sum(selling_price) as rto,
       store_city,
       sku_category,
       month(sale_date)
from sales join stores on sales.store_code = stores.store_code
join skus on sales.sku_code = skus.sku_code
where month(sales.sale_date) = month(from_unixtime(unix_timestamp()))
group by store_city, sku_category, month(sale_date)

Пример запрос для дашборда: сумма всех категорий по городам за текущий месяц.

select sum(rto) as rto,
       sum(margin) as margin,
       store_city
from sales_by_city_category
where month = date_trunc('month', now())
group by store_city

Пример запрос для дашборда: сумма всех городов по категориям за текущий месяц.

select sum(rto) as rto,
       sum(margin) as margin,
       sku_category
from sales_by_city_category
where month = date_trunc('month', now())
group by sku_category

What else do we ask in the interview?

If you have experience with Docker, K8S, Airflow, NoSQL, Kafka or Gitlab Ci / CD, then it would be interesting for us to discuss it. As a rule, there are no typical questions here, we start from what the candidate tells about his experience and point out the points of interest to us.

We hope this information will give you a good idea of ​​how the interview is going, what to expect, and how best to prepare.

We are looking for a Data Engeneer with knowledge of Python, Hadoop, Spark. You can see our vacancy for details.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *