Will tomorrow be the same as yesterday?

Splitting into a training and test set without taking into account time is a common mistake that everyone already knows about. Proper partitioning can lead to non-obvious problems that we will discover and solve.

Let's look at a small example. We need to forecast revenue by store for the next month, the signs are revenue aggregates in the past, data for the year is available. We will specifically choose a standard model, gradient boosting over decision trees. For clarity, we will evaluate the quality by the average relative error MAPE.

Let's split the data into training (train) and test (test) samples randomly, train the model and evaluate the result. We omit cross-validation and validation sampling for simplicity.

The random split approach produces an error on test of 12%. This may be enough, but in production a model trained on the past will predict future revenue. This is not taken into account in the random split.

Comparison of splits for one store

Comparison of splits for one store

Let's break it down by time, now in test the last 3 months. It is always worth understanding how the model will be used and choosing a validation strategy that replicates production.

We train a new model, MAPE has grown from 12% to 24%! Moreover, the error is growing every month.

The further into the future, the more the model is wrong.

The further into the future, the more the model is wrong.

In reports or demos, 12% looks better than 24%, but in production the model will work with an error of 24%. If you evaluate a model incorrectly, you may be unpleasantly surprised by the AB test or acceptance results.

What happened?

We are faced with a data shift or dataset shift. The world changes over time, and these changes are reflected in data. A model trained on the past knows nothing about them. Trying to learn on train, the model begins to make more and more mistakes test.

Hidden text

The bias can be mistaken for overfitting. In our example, the error on test is 24%, and on train 6%. Preventing overfitting will improve metrics, but will not solve the problem.

What does this mean for us:

  • Big forecast error. 24% Maybe enough, but less is possible.

  • Rapid degradation of the model. Without regular retraining, 19% in the first month will quickly turn into 30%.

  • It's difficult to train a model. Changing hyperparameters and working with data has little effect on quality metrics.

Let's detect the displacement

There are 3 reasons for displacement and they can all occur simultaneously:

  • Covariate shift: Different distribution of features in train and test.

  • Prior probability shift or label shift: Different distribution of the target variable.

  • Concept drift: The relationship between the features and the target variable has changed.

Several methods for detecting displacement, let's go from simple to complex:

  • Check changes in the target variable and features over time with your eyes. Sometimes you can notice obvious anomalies and trends.

  • It is better to compare the distributions of features and the target variable in train and test using stats. tests. This method detects bias in individual features, but does not take into account their interaction.

  • Build a classifier distinguishing train and test. We combine train and test samples, features from the source dataset, and the target variable is membership in train or test. The higher the quality of the classifier, the stronger the bias. The interaction of features is already taken into account; by feature importance, biased features can be determined.

Now let's figure out what's happening here. It is not necessary to immediately rush into prepared datasets; the overall picture can be quickly understood from ready-made reports and dashboards, this is especially helpful when your data is not ready yet. Let's open a report with indicators for our company and look at revenue.

Average revenue by store from company report

Average revenue by store from company report

Thanks to common efforts, revenue in stores is growing and continues to grow, there is a trend. Let's look at the target variable in our data.

Train and test differ in the values ​​of the target variable

Train and test differ in the values ​​of the target variable

The same. Logically, this is the same indicator as in the previous chart. Moreover, our indicators are calculated from revenue for previous months; there is a trend in them too. The reason for the bias is a trend that the model cannot take into account. Let's see how it affects the model:

  • The distribution of attributes calculated from the revenue history changes. Learned decision tree splits start to perform worse.

  • The distribution of the target variable changes. The decision tree cannot extrapolate; the model has never seen such large revenue values ​​and cannot predict them.

  • In the past there was no trend or it was different. We don't have enough data with the necessary dependencies.

In real tasks, everything will be more complicated: new categories of objects, various promotions and promotions appear, or the methodology for collecting data and calculating metrics changes.

Hidden text

A few examples from life.

While working on a project in the oil industry, we noticed that after a certain point the data becomes much larger and the quality decreases. It turned out that the owner of the field had changed. They began to drill more actively and collect indicators differently.

Another case arose while working on a dynamic pricing system. Right before the launch, a trend appeared in the product metrics, as in our example. The built models could not cope with it; detrending and other methods helped.

Let's eliminate the bias

In short, we need to make train and test similar. A few simple methods:

  • Removing the most different features. It can help, but as a rule, the bias is not contained only in some specific features, and we also lose valuable information.

  • Oversampling or undersampling. Let's sample the data in train so that the distributions in train and test are similar. Helps with label shift.

  • Transformation of features and target variable: transition to relative values, data scaling, merging or splitting categories, detrending, etc.

  • Deleting data. For example, if there are anomalies in test or train and you are sure that the model should not handle them, you can remove them.

You can also use techniques from transfer learning or collect more data.

Now let's remove the trend. From each revenue value, we subtract the previous value and recalculate the characteristics. The model will not forecast revenue, but the change in revenue relative to the previous month.

To assess the quality, we will definitely perform the reverse transformation: add revenue for the previous month to the model forecast. Now the error is 7% and does not grow over time! We not only reduced MAPE from 24% to 7%, but also made our model more reliable. If the trend continues, the model will work correctly.

After removing the trend, the error not only became smaller, but also did not grow over time

After removing the trend, the error not only became smaller, but also did not grow over time

In practice, it is not necessary to completely remove the offset. Once we are satisfied with the quality, we finish. If in doubt, check that there is a difference in stat errors. significant.

Bottom line

When splitting a dataset over time, you may encounter bias. Detecting and eliminating bias is not easy, but the result is a reliable and accurate model. The invested effort will definitely pay off in production.

What we learned:

  1. It is important to choose the right validation strategy. The quality may vary significantly, but you will only find out about this in the product.

  2. It's always worth checking how the data changes over time. At least in terms of the main indicators and the target variable.

  3. Removing bias improves the quality of the model and makes it more reliable.

Tell us in the comments about your cases of bias, incorrect or unusual methods of validation. I will be glad to any additions and clarifications.

Useful links:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *