Checklist before calibrating a machine learning model

Often, in theory, the operation of the model looks simple and neat, but when you get a set of real data and the task of calculating it, it can cause a stupor. We give 7 useful tips from Petr Lukyanchenko, ex-Team Lead Analytics at Lamoda and the head online course “Mathematics for Data Science. Advanced level”


Hello! This is Petr Lukyanchenko (Petr Pavlovich). My checklist is a collection of thoughts that have developed over the years of bumps and mistakes.

1. Statement of the problem

Always double-check the problem you want to count. What are you going to do? To classify something? Calculate? A clear understanding of the task determines your next action.

2. Data (Garbage In = Garbage Out)

Always make sure there are no duplicates in the data. The phrase “Garbage In = Garbage Out” means that if the data is collected somehow, then the result will come out somehow. By the way, that is why there is a separate profession of Data Engineer – specialists who, often with heroic labor, clean out simply disgusting data. They know how to identify outliers deviations in them, remove them, correct them, so that later analysts can work with high-quality data sets.

3. Subject area

Always know the domain in which you are building your regression. This will help test the hypotheses for realism. And because of that understanding, you will avoid the wasted effort of counting silly regressions from the series “How the speed of melting glaciers affects the growth of the rabbit population in Australia.”

4. Model logic

You cannot work without logic. Understanding the logic of the model, whether there is logic in this relationship is very important. In this case, the result obtained may even be of high quality, but at the same time it cannot be interpreted. Therefore, if it seems that there is no logic, it is better not to count the regression, because in this case it will turn out to be stupidity, which will lead to new erroneous decisions.

5. Metrics on the test is more important than metrics on training

When we train regression, we use a metric to train. This is an MSE metric or an alternative to it. And when we have counted many regressions, then we can compare them with each other. The R-square metric is already used here.

The regression training metric and the regression evaluation (testing) metric are two different metrics. And if a model has learned well, this does not mean that it will be well tested. Each of these metrics must be carefully and correctly selected.

6 the simpler the regression, the better it will work

And the harder the regression, the more likely it is that something will go wrong.

7. Better good regression now than perfect one hour later

If you’ve come up with a good regression solution, it’s best to stop there. Don’t try to do something perfect, super precise. Sometimes trying to improve can actually worsen. Yes, you want to achieve 100 predictions, but in real life there is no 100% quality. Even the best quality metrics on Kaggle are 96-98%.

Now in the calibration of models there is a lot of manual intellectual labor that requires certain skills from a specialist. Yes, we all strive for auto-ML, i.e. Python’s automatic selection of the best model. But so far this is an unattainable state, and it is impossible to choose the right model without understanding the mathematical apparatus. Imagine that you get a time series similar to the chart below, and you are asked “Please predict …”.

On such a date set, you can build a large number of different regressions, where each will give its own forecast. Here’s how to choose the best forecast, how to identify outliers in data, and many other practical things we go through on Advanced Course Mathematics for Data Science

Therefore, if you are already working or are just going to move into the field of Data Science, but you know mathematics at the level of “passed something at the institute”, here you will get all the missing skills.

More useful information can be found in the author’s telegram channel Peter


Read more:

  • Machine learning faces unsolved math problem
  • 3 pitfalls for aspiring Data Scientists

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *