3 traps that fall for Data Scientist beginners

This is what can happen if you are not good at math.


Hello! This is Petr Lukyanchenko, author and leader of online courses “Mathematics for Data Science” in OTUS. In the classroom, we love to illustrate everything with cases, so here, too, every problem that beginners encounter, I will start with an example.

History No. 1. Once, while I was still working as a team lead in the analytics department in Lamode, they showed me the calculation made by the trainee. He took data on how much time the user moves the mouse in the online store, and the number of products that he buys. And he built a relationship between them, where the correlation reached almost 0.95. Simply put, he “proved” that the more a person moves the mouse, the more he buys. Rejoiced at this discovery, the guys immediately suggested modifying the store’s website in order to force users to spend more time moving the mouse, thereby calculating to increase sales.

What happened and to whom to believe – the numbers or common sense, which suggests that somewhere here clearly an error crept in?

Erroneous hypotheses

In our story, the trainee prepared the data incorrectly because he did not understand what kind of dependence to assume. This is the most common and dangerous mistake that newcomers to data analysis make.

In all classes we broadcast two things:

  1. Any analysis should begin with a hypothesis
  2. The hypothesis may be erroneous. It is not scary to make a mistake, it is important to understand, correct and continue the analysis in time.

The ability to formulate hypotheses, which are subsequently tested on data, causes the greatest difficulty for beginners, interns and young specialists in Data Science. They, as a rule, know the statistics quite well, but do not have experience, therefore they often blindly believe that a good value of the metric signals that their result is valid. Because of this, newcomers are often driven by the desire to obtain a high correlation value. But a high correlation in itself is not a guarantee of the right dependence!

Imaginary correlations (regressions) are usually very funny. You can take any two parameters, and if each of them has a trend component, then the estimated correlation will turn out to be close to unity, while the parameters themselves may not have any relationship.

For example, a person studies glaciers in Greenland and decides to see how the amount of precipitation in Thailand during the monsoon season affects the rate of ice melting. In a given period, both of these variables increase, that is, they have some trending components: in Thailand, the amount of precipitation grows when the hot period begins and the glaciers melt faster. If we consider the correlation “head-on”, it will be close to unity, which means that there is a direct relationship between the values. Therefore, before analytics, you must first work with the data – clear them of the trend component, i.e. Detrend and get the daily value of the increase. And now these Δx variables are used to obtain correlation. This is a very simple thing, which nevertheless significantly improves the quality of analysis.

History No. 2. At the height of spring, a pharmaceutical company decides to start predicting the rate of sales of allergy medications. The analyst takes some kind of model, gets addicted, and at first everything is great – the predicted numbers are confirmed by actual demand. But starting in September, the numbers have diverged: the model promises sales growth, and in fact, demand has stopped along with the end of the flowering period and solar activity. What could have gone wrong?

Often, beginners choose models with exponential growth, which in the beginning, although it gives results close to reality, at one point becomes unusable. In this case, in constructing the dependence, it was necessary to take into account the seasonal component and either correct the model after the end of the season, or, more correctly, immediately work with an oscillatory function, such as a sinusoid.

It is the wrong choice of the time period for calibration, when external factors are not taken into account, that is the most common mistake when the model working at first becomes useless.

Load data into the model as in a black box

For several years of rapid development of the areas of Data Science, mankind has accumulated impressive libraries of models and methods of data processing. And this is great – they can be used to solve ordinary problems, which many experts resort to, not only beginners, but also experienced ones. The danger is to take the finished model, just stick the data into it and get some predictive value at the output. An experienced specialist always uses math tools to test and adapt the method to his task.

For beginners, at first it is difficult to identify a restoration of the empirical distribution in existing data. And even if a novice specialist successfully selects the appropriate method in the library or a senior colleague helps him with setting up the model, another danger lies in wait for him: at any time, the nature of the data behavior may change or the internal process of the time series may change. This means that you need to quickly recalibrate the model, because its accuracy has decreased, and as a result, the effectiveness of the entire prediction has fallen. In order to catch this and adjust the model, you need to own statistical methods and understand the principle by which it works.

Even if the method is programmed in Python and is somewhere in the box, at least once it must be displayed manually to understand how it works. If you come across this method in the project and you need to adapt it, you will already know in which chains which steps you need to do.

History No. 3. Imagine you have a data matrix of 10,000 rows per 10,000 columns. ~ 30 milliseconds are spent on multiplying each pair of elements, that is, your algorithm will process the data for more than an hour! And if it will be a billion to a billion matrix? Or do you need to run a lot of such algorithms?

Raw Matrices

It often happens that newcomers do not process or prepare matrices before analysis. As a result, the process takes away their extra time and effort. To simplify and speed up work with matrices, specialists use tools from linear algebra. It works like this: the existing data matrix is ​​projected into a low-rank subspace and thereby temporarily reduce its dimension.

You can learn how to do all this in our online courses “Mathematics for Data Science”. The basic level is designed for training from the school curriculum and focuses on the mathematical component. You should go to the Advanced level if you once, even for a very long time, studied higher mathematics or already have experience in Data Science. At the Advanced level, we analyze data analysis methods for different tasks. At the end of the course, students do design work: they try to manually implement one of the methods to understand how it is arranged and to modify one of its sections. The entrance test will help you determine the level.

The theory and practical skills that you will master in the classroom are primarily necessary for Middle specialists, but they will also be useful at the start of the profession. We conducted a survey among our partner employers in the field of Data Science and found out that more than half of them are ready to hire an intern with knowledge of mathematics, even if he does not know how to work with Python libraries.

Also, if you are working or just looking at Data Science, I invite you to subscribe to the telegram channel Data street, where I share my experience and collect useful materials from the world of mathematics, data analysis and machine learning. I will be glad to see you here at the OTUS courses!

You can learn more about the courses, as well as pass the entrance test to test your knowledge, by clicking on the links below:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *