Data Science Interview: What They May Ask and Where to Find Answers to Questions

Once I received an offer from Deliveroo, in this company I was supposed to become a Data Science Manager. While I was preparing to take up my duties, the offer was withdrawn. At that time I did not have an airbag in case of prolonged unemployment. I will share with you everything that ultimately helped me get two offers at once for the Data Scientist position from Facebook. I hope this will help one of you to get out of the difficult situation in which I found myself several months ago.

1. Organization is the key to everything

I went to interviews at Google (and DeepMind), Uber, Facebook, Amazon for everything that was somehow connected with the position of Data Scientist. Most of the time I was asked questions from industries such as

  • Software development
  • Applied statistics
  • Machine learning
  • Data processing, management and visualization

Nobody expects you to be super pro for all of these industries. But you must understand them enough to convince the interviewer of your competence and the right to take the proposed position. How deeply you need to understand the topic depends on the job itself, but since this is a very competitive field, any knowledge will come in handy.

I recommend using Notion to organize your interview preparation. This tool is versatile, plus it allows you to apply techniques such as spaced repetition and active recall. They help reinforce learning outcomes and uncover key questions that come up over and over again in a Data Scientist interview. Ali Abdaal has excellent leadership for taking notes with Notion. Helps to maximize your potential during the interview.

I kept repeating my notes at Notion, especially actively – just before the interview. This allowed me to be confident in my abilities and that the key topics and terms are in my “working memory”, so that I do not have to waste precious time, meaningfully saying “nuuuuuu” after some questions.

2. Software development

You don’t always need to answer questions about the time complexity of an algorithm. But any data Scientist job requires coding. Data Science, as you know, is not one profession, but many, this industry attracts talent from a wide variety of areas, including software development. Accordingly, you will have to compete with programmers who understand the nuances of writing efficient code. I would recommend spending 1-2 hours a day before the interview, mastering and / or strengthening knowledge and skills in such topics:

  • Arrays.
  • Hash tables.
  • Linked Lists.
  • Method of two pointers.
  • String Algorithms (employers LOVE this topic).
  • Binary search.
  • Divide and conquer algorithms.
  • Sorting algorithms.
  • Dynamic programming.
  • Recursion.

Don’t study algorithms in a formal way. This is useless because the interviewer may ask a question about the nuances of some algorithm and you will get lost. Instead, it’s better to master the foundation that each algorithm works. Explore computational and spatial complexity and understand why all of these are important for creating quality code.

Interviewers do have a lot to ask about algorithms, so it’s worth learning the basic elements and common case studies to make it easier to respond to interviews later.

Try to answer every possible question, even if it takes a long time. Then look at the decision model and try to determine the optimal strategy. Then look at the answers and try to understand why this is so? Ask yourself questions like “why the average time complexity of Quicksort is O (n²)?” or “Why two pointers and one for loop make more sense than three for loops”?

3. Applied statistics

Applied statistics play an important role in Data Science. How important will depend on the position you are applying for. Where is applied statistics actually used? Wherever it is necessary to organize, interpret and extract information from data.

During interviews, I advise you to carefully study the following topics:

  • Descriptive statistics (what distribution does my data correspond to, what are the distribution modes, mathematical expectation, variance).
  • Probability theory (given that my data fits a binomial distribution, what is the probability of seeing 5 leads in 10 click-through events).
  • Hypothesis testing (the basis of any question on A / B testing, T-tests, ANOVA, Chi-square tests, etc.).
  • Regression (whether the relationship between variables is linear, what are the potential sources of bias and data errors).
  • Bayesian inference (what are the advantages / disadvantages compared to frequency-based methods).

If you think this is a huge amount of information to study, then you do not think. I was amazed at how much you can ask for an interview and how much you can find online to help you with your preparation. Two resources helped me to cope:

It’s best not to learn it by rote. You need to solve as many tasks as you can. Glassdoor is a great repository for applied statistics questions that you usually come across in interviews. The most difficult interview I had was the interview at G-Research. But I really enjoyed preparing for it, and Glassdoor helped me to understand how far I have progressed in mastering this topic.

4. Machine learning

Now we come to the most important thing – machine learning. But this topic is so vast that you can simply get lost in it.

Below are some resources that will give you a very solid foundation to get started with machine learning. Here is a far from exhaustive set of topics, ranked by topic.

Metrics – classification

Metrics – Regression

Deviation-variance trade-off, Over / Under-Fitting

Model selection

Sampling

Hypothesis testing

This topic is more related to applied statistics, but it extremely importantespecially for A / B testing.

Regression models

There is a wealth of information available about linear regression. You should familiarize yourself with other regression models:

Clustering algorithms

Classification models

That’s a lot, but it doesn’t look so scary if you understand applied statistics. I would recommend learning the nuances of at least three different classification / regression / clustering methods, because the interviewer can always ask (and does), “What other methods could we use, what are some of the advantages / disadvantages?” This is just a small part of the knowledge, but if you know these important examples, interviews will go much smoother.

5. Data processing and visualization

“Tell us about the stages of data processing and cleaning before applying machine learning algorithms.”

We are provided with a specific set of data. First and foremost is proving that you can accomplish the EDA. It is best to use Pandas, it is, if used correctly, the most powerful tool in the data analysis toolbox. The best way to learn how to use Pandas to process data is to download many, many datasets and work with them.

In one of the interviews, I needed to load a dataset, clean it up, render it, select it, build and evaluate a model – all in one hour. It was really crazy, we were very hard. But I was just practicing doing all of this for a few weeks, so I knew what to do, even if I lost the thread.

Data organization

There are three important things in life: death, taxes, and getting a request to merge datasets. Pandas is almost perfect for the job, so please practice, practice, practice.

Data profiling

This task involves understanding the “meta” characteristics of the dataset, such as the shape and description of the numeric, categorical, and temporal characteristics in the data. You should always strive to answer a series of questions like “how many observations do I have”, “what the distribution of each function looks like”, “what do these functions mean”. This kind of early profiling can help you ditch irrelevant features from the start, such as categorical features with thousands of levels (names, unique identifiers), and reduce the amount of work for you and your computer down the road (work smart, not hard, or woke up somehow).

Data visualization

Here you ask yourself: “What does the distribution of my functions look like?” A quick tip: if you didn’t learn about box plots in the applied statistics part of the tutorial, now is the time because you need to learn how to identify outliers visually. Kernel density histograms and graphs are extremely useful tools when viewing the properties of the distributions of each function.

Then we might ask “what the relationship between my functions looks like,” in which case Python has a package called seaborn that contains cool and powerful tools like pairplot and a nice heatmap for correlation plots.

Handling null values, syntax errors, and duplicate rows / columns

Missing values ​​are inevitable, this problem arises from many different factors, each of which affects the offset in its own way. You need to learn how best to deal with missing values. Check it out guidance on how to handle null values

Syntax errors usually occur when a dataset contains information that has been entered manually, such as through a form. This may lead us to the erroneous conclusion that the categorical function has many more levels than it actually does, because “Hot”, “hOt”, “hot / n” are considered unique levels. Check out this source for handling dirty text data.

Finally, no one needs duplicate columns, and the presence of duplicate rows can distort the view, so with them should be dealt with early on.

Standardization or normalization

Depending on the dataset you are working with and the machine learning method you choose to use, it might be helpful standardize or normalize data so that different scales of different variables do not negatively impact the performance of your model.

In general, it was not so much the “remember everything” attitude that helped me as understanding how much the training helped me. I failed many interviews before I realized that all of the above are not esoteric concepts that only a select few can master. These are the tools Data Scientists use to build cool models and get important insights from data.

On this topic:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *