My name is Yulia, I am a developer of the ETNA team. I’ll tell you about how we launched an open-source tool for analytics and forecasting business processes, how it works and how to use it.
At Tinkoff, we often solve forecasting problems: we want to know the number of calls on the service line or how much cash customers will withdraw from an ATM next week. Data scientists and analysts facing forecasting challenges can use a variety of different tools for their work. This is inconvenient and time consuming. To make things easier, we developed our framework.
How we created the library
The essence of our work is to predict future events using historical data. For example, in the problem with ATMs above, we take the historical data of the turnover of each ATM and see what the turnover will be in the future. On the basis of these forecasts, possible cash collection options are analyzed and the optimal date is selected. The data we forecast is called time series.
Time series Is any time-bound data stream. For example, the number of cups of coffee sold in a mall for each day is a time series. It is necessary to predict it in order to optimize processes and efficiently allocate resources. As already mentioned, we started by predicting the turnover of cash in ATMs, then predicting the number of meetings of representatives and calls to the call center.
These processes were handled by different teams using a whole list of tools. The tools were good and helped in solving problems, but each direction created its own modifications that could not be reproduced. And there was no single center of expertise, so there were problems with the exchange of experience.
We did some research to understand what a solution should be for everyone. We talked with analysts, found out what their forecasting tasks look like and what stages of working with data cause the greatest difficulty.
It turned out that the problems begin even before the models are trained. Real data cannot always be used for training in its pure form: it contains gaps, lost dates, anomalies, and other troubles. For many processes, it is necessary to predict not one or not five, but thousands of series. Attempting to process data manually results in hundreds of lines of code, loops, and complex constructs.
And when, it would seem, all the difficulties were over, a number of questions arose: how to correctly measure the quality of the model? how to build a validation process? how to quickly and painlessly compare several models? how to use additional data? how to generate features for training the model?
In order to find answers to these questions, we created the ETNA library.
How ETNA works
What does the ETNA forecasting and experimentation process look like? Forecasting can now be divided into several important steps.
Data preparation and validation. To work with data in the ETNA library there is a TSDataset class. It allows you to bring all series to a single format, restore lost frequencies, and also implements a data connection for forecasting with additional data.
import pandas as pd from etna.datasets import TSDataset df_flat = pd.read_csv("data/example_dataset.csv") # приводим данные к ETNA-формату df = TSDataset.to_dataset(df_flat) ts = TSDataset(df=df, freq="D")
TSDataset guarantees correct work with data in the present and in the future.
Preliminary data analysis (EDA). In order for users to understand the structure and features of the predicted series, we have added EDA methods. They allow building statistics on data, assessing autocorrelation, and detecting outliers.
from etna.analysis import sample_acf_plot sample_acf_plot(ts=ts)
from etna.analysis import get_anomalies_density, plot_anomalies anomalies = get_anomalies_density( ts=ts, window_size=45, n_neighbors=25, distance_coef=1.9 ) plot_anomalies(ts=ts, anomaly_dict=anomalies)
Building a forecasting pipeline. Based on the EDA results, it is possible to understand which features to extract from the data, as needed. process the rows for further work. For example, subtract the trend or logarithm.
from etna.transforms import ( LinearTrendTransform, DensityOutliersTransform, TimeSeriesImputerTransform, ) transforms = [ # удаляем выбросы из данных DensityOutliersTransform( in_column="target", window_size=45, n_neighbors=25, distance_coef=1.9 ), # заполняем образовавшиеся пропуски TimeSeriesImputerTransform( in_column="target", strategy="running_mean" ), # вычитаем тренд LinearTrendTransform(in_column="target"), ]
All models in ETNA have a single interface, so regardless of the previous steps, you can use any of the models presented.
from etna.models import SeasonalMovingAverageModel from etna.pipeline import Pipeline model = SeasonalMovingAverageModel(seasonality=7, window=5) pipeline = Pipeline( model=model, transforms=transforms, horizon=14 )
Forecasting and validation. To test how well the presented pipeline will perform for these series, you can run backtesting.
from etna.metrics import MAE, SMAPE, MSE, MAPE from etna.analysis import plot_backtest METRICS = [MAE(), MSE(), MAPE(), SMAPE()] metrics, forecasts, info = pipeline.backtest( ts=ts, metrics=METRICS, n_folds=5, ) plot_backtest(forecast_df=forecasts, ts=ts, history_len=50)
# в metrics содержатся метрики прогнозирования для каждого # фолда валидации для каждого сегмента metrics.head(7)
We made sure that the user does not have to worry about the technical side of the experiment and concentrates on working with his task, so we implemented several auxiliary tools in ETNA:
Lots of loggers for integration with W&B, saving intermediate results to s3 or a local file, outputting the log to the console.
Convenient CLI that allows you to configure and run experiments without writing any code through the yaml config.
Results analytics tools: regression metrics and methods for visualizing forecasts and their confidence intervals.
How we entered open source and how we plan to develop further
We created a library for internal needs. We wanted convenient and flexible work with code and experiments, but we did a lot of experiments and decided to structure and optimize them.
As a result, we got a tool that makes the process of solving our problems simple, understandable and convenient, so we wanted to share it with the community.
Our team sees many advantages in this:
Practicality for experimentation: logging, configurations, built-in data preprocessing and feature extraction.
The ability for users to give us feedback or tell us about their needs.
A single interface for all models, which makes the framework convenient.
The ability to make ETNA more convenient and versatile when solving user queries, thereby improving our experience and developing the library.
Expertise growth. The wider the audience with which we interact, the more the framework and community develops. We can share stories, life hacks and exchange expertise with each other.
Our colleague Roman Sedov has already written about the benefits of open source for the company: https://habr.com/ru/company/tinkoff/blog/593643/
In the near future, we are going to focus on the internal implementation of the library. The first points in our ETNA library todo are accelerating forecast pipelines and optimizing big data. But new methods of analytics and feature generation will not be ignored either. In addition, we plan to prepare a large number of articles and examples showing how series can be predicted and what features can help in this.