Time series forecasting using the Skforecast library

There are a huge number of open source libraries for building machine learning models in Python. Most Popular – scikit-learn, XGBoost, LightGBM, Catboost, PyTorch. Each of them allows you to build a regression model for forecasting on time series, but this requires data transformation and the creation of new features (feature engineering).

In addition, time series require their own approaches in evaluating machine learning models, since standard cross-validation is not suitable for time data. In this article We (me + me) let’s look at the nuances of forecasting in practice and using the skforecast library.

Features of validation and testing of models on time series

Hold-out

testing machine learning models – splitting data into training, validation and test (train/val/test split).

• Training pads are used to train models
• Validation is needed to select model hyperparameters
• Test – for final evaluation of the model using optimal hyperparameters

With this approach, the model should not see either validation or test data. After selecting model parameters, test and validation data can be glue, the training sample will increase. In this case, the original data is randomly shuffled before division to ensure unbiasedness.

This approach is not applicable for time series; mixing will lead to the loss of the time structure of the series. In this case train/val/test split used without stirring.

Features of cross-validation of models on time series

Cross validation

is another method for validating ML models. Usually, when they talk about cross-validation, they mean

k-Fold

and its modifications. Unlike Hold-out testing, the data is divided into

k

identical parts –

folds

. Further passes

k-iterations

, in which one of the folds is selected, the model is trained on k-1 folds and tested on the selected one. The final score is calculated by averaging the results of iterations, or on a deferred test set.

As in the case of hold-out testing, with conventional cross-validation with time series, their internal structure will be lost. Therefore, the following scheme is used:



Source

The data is divided into k-folds, only training folds must necessarily go before the validation. As testing progresses, more training data becomes available.

There are also modifications of this method; you can get acquainted with them briefly here: link

Metrics

In addition to the standard MSE and MAE metrics, MAPE and WAPE and their modifications are used in time series prediction.



A (actual)

– real values.

F (forecasted)

— predicted values.

Practice

There are many libraries for time series prediction. One of them is skforecast (

Documentation

).

It provides a simple sklearn interface.
The advantages of this library:
1) Ease of use
2) Supports most classic MO models
3) A large number of auxiliary functions (grid_search, cross_validation)
4) Automatic creation of lag parameters for the model

# Импортирование нужных библиотек
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.model_selection import backtesting_forecaster
from skforecast.model_selection import grid_search_forecaster

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error   # Загрузка и обработка данных   
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler 

from catboost import CatBoostRegressor

# данные представляют из себя даты и переменную, колонки - datetime и y
url = (
    'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/'
    'data/h2o.csv'
)
data = pd.read_csv(url, sep=',', header=0, names=['y', 'datetime']) # Загружаем
data.datetime = pd.to_datetime(data.datetime) # Приводим дату в тип pandas
data = data.set_index('datetime').asfreq('MS').y # Делаем колонку даты индексом, даем ей периодичность месяц ('MS' - month start)
# Задаем обучающие данные
# Обычно предсказание нужно с определенного момента, которое уточняется с заказчиком
# Тут используем обычную hold-out валидацию 
val_start = pd.to_datetime('2004-01-01')
test_start = pd.to_datetime('2006-01-01')
train = data[data.index < val_start]
val = data[(data.index >= val_start) & (data.index < test_start)]
test = data[(data.index >= test_start)]
# Отрисуем данные
fig, ax = plt.subplots(1, 1, figsize=(10, 8))
ax.plot(train, label="train")
ax.plot(val, label="val")
ax.plot(test, label="test")
ax.set_xlabel('Время')
ax.set_ylabel('Предсказываемая переменная')
plt.legend()
plt.show()

# Определяем простейшую модель
linear_forecaster = ForecasterAutoreg(
    regressor=LinearRegression(),
    lags=12
)

# Обучаем модель
linear_forecaster.fit(train)

# Строим прогноз
predictions = linear_forecaster.predict(len(val))

# Печатаем метрики
print(f"MAPE = {mean_absolute_percentage_error(val, predictions)}")
print(f"MAE = {mean_absolute_error(val, predictions)}")
print(f"MSE = {mean_squared_error(val, predictions)}")

Result:

MAPE = 0.05773219888680609 
MAE = 0.05362243227390879 
MSE = 0.00446558085422573 
# Строим графики
fig, ax = plt.subplots(1, 1, figsize=(10, 8))
ax.plot(train, label="train")
ax.plot(val, label="val")
ax.plot(predictions, label="predicted")
ax.set_xlabel('Время')
ax.set_ylabel('Предсказываемая переменная')
plt.legend()
plt.show()

# Еще одной классной функцией является feature_importance
linear_forecaster.get_feature_importances()

# Большой плюс этой библиотеки - она работает с большинством моделей, поддерживающими интерфейс sklearn, например CatBoostRegressor
catboost_forecaster = ForecasterAutoreg(
    regressor=CatBoostRegressor(random_seed=123, verbose=False),
    lags=12
)
catboost_forecaster.fit(train)
predictions = catboost_forecaster.predict(len(val))

print(f"MAPE = {mean_absolute_percentage_error(val, predictions)}")
print(f"MAE = {mean_absolute_error(val, predictions)}")
print(f"MSE = {mean_squared_error(val, predictions)}")

Result:

MAPE = 0.06647131412453779
MAE = 0.06321521958327374
MSE = 0.005279756555116755
fig, ax = plt.subplots(1, 1, figsize=(10, 8))
ax.plot(train, label="train")
ax.plot(val, label="val")
ax.plot(predictions, label="predicted")
ax.set_xlabel('Время')
ax.set_ylabel('Предсказываемая переменная')
plt.legend()
plt.show()

As we can see, during validation catboots performed even worse than linear regression! Most likely, this is due to the fact that we are using standard model parameters. Fortunately, skforecaster supports grid search – automatic selection of model parameters

# Создадим модель

catboost_forecaster = ForecasterAutoreg(
    regressor=CatBoostRegressor(random_seed=123, verbose=False),
    lags=1 # Этот параметр будет меняться, поэтому можно поставить его каким угодно
)

lags_grid = [6, 12, [1, 2, 3, 6, 12]] # задаем сетку лагов

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15]
}
# Все параметры можно найти в документации, пройдемся по главным:
# forecaster - модель
# y - данные на которых мы хотим учиться
# param_grid, lags_grid - параметры, которые мы хотим тюнить
# steps - горизонт предсказания, на котором мы хотим валидироваться,  нашем случае - len(train)
# initial_train_size - тренировочные данные,  нашем случае - len(train)

# Возвращаемое значение - DataFrame с параметрами и результатом тестирования
results_grid = grid_search_forecaster(
                   forecaster         = catboost_forecaster,
                   y                  = data.loc[:test_start],
                   param_grid         = param_grid,
                   lags_grid          = lags_grid,
                   steps              = len(val),
                   refit              = False,
                   metric="mean_absolute_percentage_error",
                   initial_train_size = len(train),
                   fixed_train_size   = False,
                   return_best        = True,
                   n_jobs="auto", 
                   verbose            = False, 
                   show_progress      = False
               )
results_grid = results_grid.reset_index()

Result:

Number of models compared: 27.
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [ 1  2  3  6 12] 
  Parameters: {'max_depth': 5, 'n_estimators': 50}
  Backtesting metric: 0.05251455987629346

As you can see, in this case, Catboost performed better during validation, MAPE=0.0525. The sklearn Pipeline can also serve as a model. The code will remain absolutely the same, all that will change is the parameter regressor in ForecasterAutoreg and param_grid.

pipe = Pipeline(steps=[
    ('scaler', StandardScaler()), 
    ('model', CatBoostRegressor(random_seed=123, verbose=False))
])

catboost_forecaster = ForecasterAutoreg(
    regressor=pipe,
    lags=1 # Этот параметр будет меняться, поэтому можно поставить его каким угодно
)

param_grid = {
    'model__n_estimators': [30 ,50, 100, 200],
    'model__max_depth': [3, 4, 5, 10]
}
# Теперь перетренируем линейную регрессию и catboost на всех обучающих данных и протестируем модели

linear_forecaster = ForecasterAutoreg(
    regressor=LinearRegression(),
    lags=12
)
linear_forecaster.fit(pd.concat([train, val]))
lr_predictions = linear_forecaster.predict(len(test))

print(f"MAPE = {mean_absolute_percentage_error(test, lr_predictions)}")
print(f"MAE = {mean_absolute_error(test, lr_predictions)}")
print(f"MSE = {mean_squared_error(test, lr_predictions)}")

Result:

MAPE = 0.0733522395192984
MAE = 0.06219065243583432
MSE = 0.005735397594558176
pipe = Pipeline(steps=[
    ('scaler', StandardScaler()), 
    ('model', CatBoostRegressor(**{'max_depth': 5, 'n_estimators': 50}, random_seed=123, verbose=False))
])
catboost_forecaster = ForecasterAutoreg(
    regressor=pipe,
    lags=[1, 2, 3, 6, 12] 
)

catboost_forecaster.fit(pd.concat([train, val]))
cb_predictions  = catboost_forecaster.predict(len(test))

print(f"MAPE = {mean_absolute_percentage_error(test, cb_predictions)}")
print(f"MAE = {mean_absolute_error(test, cb_predictions)}")
print(f"MSE = {mean_squared_error(test, cb_predictions)}")

Result:

MAPE = 0.05855572171752432
MAE = 0.04682786970015694
MSE = 0.003394050288769456
fig, ax = plt.subplots(1, 1, figsize=(10, 8))
ax.plot(test, label="test", c="black")
ax.plot(lr_predictions, label="lin_reg", ls="--")
ax.plot(cb_predictions, label="Catboost", ls="--")

ax.set_xlabel('Время')
ax.set_ylabel('Предсказываемая переменная')
ax.set_title('Сравнение моделей')
plt.legend()
plt.show()

Conclusion

The skforecast library provides convenient tools for working with time series and simplifies the process of creating and configuring models. It provides the ability to accurately and reliably predict timing data in a variety of applications.

This library has many more advantages not mentioned in the article:
1) Using the exog parameter, you can pass external parameters to the model, such as time of year, day of the week, and so on. This is very useful in many cases.
2) It supports multi-row prediction.

On this positive note, I would like to end our story and invite everyone to comment on this article. We hope that the material was useful to you. Thank you for your attention. Come again.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *