How we automate training and updating models at Alfa Bank

Or how to save DS from routine tasks of training and updating models and their further redevelopment in production?

Hi all! I am Nastya Bondareva, senior Data Scientist at Alfa-Bank Legal Entities Hub, leading the ARTEML (AutoReTrainable ML) initiative. In the article I will tell you how we simplified our work and some of the routine tasks, the number of which grew like a snowball with the increase in the number of models used.

What is the purpose of AutoReTrainableML and why is it needed?

In articles and conferences on ML, the life cycle and pipeline of models are usually represented by a classical linear structure. It looks like this: collecting a target, collecting features and selecting them, training the model and validating it, deploying the resulting model into production to set the model up for regular calculations.

And usually open source AutoML tools implement just such a concept. Most often, they include the stages of selecting features, tuning hyperparameters, and training the final model. Some open-source implementations also include validation of the final quality of the model, but, as a rule, this is not enough and additional analytics must be carried out.

But in reality, such a linear pipeline does not cover all of the organization’s business needs, and here’s why. During business exploitation of the developed ML models, over time they begin to degrade – new dependencies appear in the data, some dependencies lose their relevance, as a result of which the model quality metric is very different from the one obtained when training the model.

Using one of the models as an example, let’s look at how the precision@k metric changes over time. The share of product sales after marketing communications falls by 38% per year. This is due to the fact that over time the model begins to degrade and rank worse customers who are inclined to purchase this product after marketing communications.

Image

The considered case of model degradation is not special, it is a pattern.

It is worth emphasizing that most models for conducting marketing communications are prone to becoming obsolete. For example, in our case, on average, models lose up to tens of percent accuracy per year.

There are several reasons for this.

Features Drift. Most often, model performance becomes outdated due to changes in the features on which the model was trained. For example, user data or product consumption data have changed due to various reasons: seasonality or economic instability, due to the emergence of new products and, as a result, new interests in the demand market, etc. The model was used to “seeing” one picture, but now it has changed, and the accuracy of the forecasts has “swimmed.” But in addition to old patterns becoming obsolete, new ones can also appear, and due to this, the quality of the model can be improved if the model is updated in time.

Labels Drift. Change in target label distribution over time. A clear example is a change in the size of the average check per client due to accelerated inflation, which directly affects the quality of the model’s prediction.

Image

Data Quality Issues. Problems related to the quality of the supplied data. For example, a source is disconnected or malfunctioning, causing noise/gaps in the data.

Degradation of models affects business performance. Let's look at a “toy” example that demonstrates this. Let’s assume that a company carries out marketing communications offering a product to those clients recommended by the model, with the following indicators:

  • marketing channel capacity – 10,000 calls per month;

  • cost of 1 call – 400 rubles;

  • income from the sale of a unit of product is 800 rubles.

With an accuracy of 69%, the income from using the model is:

10,000 * 0.69 * 800 – 10,000 * 400 = 1,520,000 rubles

When accuracy drops to 34%, the business process becomes unprofitable for the bank:

10,000 * 0.34 * 800 – 10,000 * 400 = – 1,280,000 rubles

In our case, the break-even limit for the model metric is 50% accuracy.

Image

The main option for solving the problem of “fighting degradation” is to retrain models. This is one of the main methods of maintaining the quality of models at an acceptable level.

But the retraining process also comes with some challenges.

Why did we decide to retrain models automatically?

Retraining or updating a model manually is often an inefficient and tedious process.

The expert's time is used both to interpret the results of monitoring and retraining, and to make a decision on retraining (comparing the quality of the new revised and current production model) and introducing the model into an industrial execution system.

Each iteration takes time to write code, and then to review and adjust it. The larger the portfolio of models, the more DS resources are needed to update them. In 2023, updating models at Alfa Bank took up to 10% of our DS time. Given the increase in the number of models in 2024, this should take from 25 to 35% of the time.

Also, when updating a model that is already working in production, it is necessary to go through a routine IT process including: review, functional and load testing, and rolling out the model. At the same time, at each stage, in addition to DS, the corresponding specialists are also involved: ML engineer, testers and support responsible for the combat stand.

But such a process can be optimized through templates.

This is possible through the creation of tools that unify and systematize common methods within a single library and a new model inference pipeline.

  • Creating such a library will allow you to train and retrain models faster and with minimal expenditure of Data Scientist resources, thereby freeing up time.

  • Creating such an inference pipeline will put the current model update process on track.

These prerequisites prompted us to create the AutoReTrainable ML Framework.

AutoReTrainable ML Framework is a system that can automate the processes of creating (training) and updating (retraining) models for business problems.

In this paradigm, the ML pipeline looks like this:

The image highlights a framework for retraining models in production. This framework consists of the AutoReTrainable ML (ARTEML) library, with the help of which all the specified steps take place, and the MLOps pipeline for outputting to the model execution system.

As a result, when combining these two tools, DS is required during initial development to:

  • train and evaluate the quality and adequacy of the resulting model using ARTEML;

  • output the resulting model through the MLOps pipeline for auto-retraining;

  • subsequent procedures for monitoring and updating the model are already automated.

The withdrawal itself is carried out through an accelerated process, since under the hood the ARTEML tool is used, which is verified and covered with tests.

How does the ARTEML framework work?

This is a set of tools designed to simplify and speed up the work of DS by automating many routine stages of training, inference, monitoring and comparison of models, as well as describing the steps and parameters of the model in the form of a low-code interface. ARTEML is written in Python and, in addition to ML logic, also contains interfaces for working with banking systems and data warehouses.

Config

DS interacts with the ARTEML framework via YAML config. An example of the config will be discussed below.

In fact, this is some kind of instruction, which contains all the necessary information: where to get the target and features, for what periods to take data and how to sample it, how to select features, how to train the model, etc.

Accordingly, the first step for the user is to create such a config, the starting point for performing the remaining steps.

There is a function for generating a basic config script, where you only need to fill in the required parameters, after which DS can already configure and add the necessary additional parameters. Also, along with the generation of the basic config, a file is created with the logic for collecting samples for the steps of inference, monitoring and comparison of models, which DS can also change, if necessary, by adding additional filters or radically changing the logic to suit its task.

Education

After creating the config, we move on to training the model. We won’t go too deep, here we use fairly standard steps shown in the diagram above. We take all the necessary information on setting up the learning process from the config. Let's look at an example of a config already on the derived model.

model_name: automl_model
version: 1
model_info: ...
key_metric: roc_auc
log_to_mlflow: true
report_path: report
target: ...
features: ...
preselect_features:
  data:
    dt: ...
    range_dt: true
    sampling_parameters:
      sampling_n_rows: 200000
  feature_selection_params:
    options: 
    - by_nan_rate
    - split
    nan_rate_acceptable: 0.98
    n_top_features:
      split: 200
train:
  data:
    train: ...
    val: ...
  feature_selection_params:
    options: 
    - by_nan_rate
    - by_psi
    - shap
    - permutation
    nan_rate_acceptable: 0.98
    psi_params: ...
    n_top_features: ...
  model_path: model
  model_train_parameters:
    init_params: ...
    fit_params: ...
calibration:
  data: ...
  parameters:
    use_score: score_raw
    method: auto
    num_bins: 10
    split_type: quantile
test:
  data: ...

Our team conducted many experiments with different well-known AutoML libraries on different models (the obtained metrics on one of the considered models are presented below).

image19.png

Based on the results of experiments and evaluation of quality, speed and convenience metrics, we chose the open source solution from Amazon – AutoGluon as the main ML training core in ARTEML. Under the hood, it combines well-known model types: CatBoost, LightGBM, XGBoost, Random Forest, Linear Regression and others. You can also set options to exclude certain types of models, set time limits, etc.

In ARTEML, for each model in MLFlow, an experiment is created with the name of the model, and the model itself is also registered in the MLFlow Model Registry. During training, we log many artifacts, from which we can draw conclusions about the process itself and the results obtained.

The artifacts are:

  • links to datasets saved on Hadoop so that they can be retrieved via the MLFlow API;

  • parameters of the final model;

  • metrics obtained during training;

  • config from which this training started for reproducibility of results;

  • received training logs;

  • information on feature selection;

  • the model itself and the leaderboard of the resulting model;

  • various artifacts of model quality testing: metrics calculated on full samples and by date, distribution graphs of these metrics, target distribution by model percentiles, etc.;

  • model calibration artifacts: obtained coefficients, distribution graphs of calibration coefficients.

Ideally, DS should not count anything with their hands upon completion of training: if the need arises, we supplement the current registry with new artifacts.

Logged wound in mlflow according to the new version of the model

Logged wound in mlflow according to the new version of the model

Example of a created artifact: graph of top features by permutation importance

Example of a created artifact: graph of top features by permutation importance

An example of a created artifact: distribution of ROC curves on training, validation and testing samples

An example of a created artifact: distribution of ROC curves on training, validation and testing samples

Inference

Here, again, all the necessary information is contained in the config – the path to the file with the logic for collecting the sample, where and in what form to save the obtained rates.

inference:
  sample_collection_script_path: scripts/sample_collection.py
  sample_function_name: inference_sample
  table: ...
  score_calibration: true

To start inference from the obtained experiments, the production version of the model and the necessary artifacts (calibration coefficients) are pulled into MLFlow. Based on the configs, it becomes clear which showcases are used to generate features. Again, everything is automated; the only changeable parameters are the date for which scoring and a stand (test, combat) are needed.

Controller

The controller combines the steps of monitoring and comparing models (Quality Gate), since the functionality in both cases is similar – you need to calculate metrics on control samples. Let's look at each of the steps separately.

A. Monitoring

The decision to retrain the model is made based on the retraining flag in monitoring, which we get positive when triggers are fired – if quality metrics no longer meet business needs. Accordingly, this module calculates and logs metrics for the current sales model. Then, based on the obtained metric and comparing it with some baseline, a decision is made on the need for retraining.

Each model is unique and requires the calculation of different metrics with different parameters, which DS can set through the config. For example, the main ones for propensity models are: ROC-AUC, Precision@k, Recall@k, PSI speed (responsible for the stability of the obtained predictions). Here is an example of a config with parameters that can be controlled by DS for monitoring.

monitoring:
  sample_collection_script_path: scripts/sample_collection.py
  sample_function_name: monitoring_sample
  metric_table: ...
  dt: ...
  key_metric: roc_auc
  base_score: 0.85
  epsilon: 0.03
  metrics:
  - roc_auc
  - precision_at_k
  metrics_params:
    precision_at_k:
      k_range:
      - 0.1
      - 0.2
      - 0.3
      mode: share
  additional_info:
    psi_features: 0.1
    psi_score: 0.1
    psi_target: 0.1

B. Model Comparison (Quality Gate)

The decision to update the model to a new one is made based on the results of comparing two models (the current one and the new retrained one) with each other according to metrics. Accordingly, here we also count and log metrics, only for two models. At the end, we compare them: if the new model is superior in quality to the previous one, then we deploy it to production. The parameters in the Quality Gate config are quite similar to the monitoring parameters.

quality_gate:
  sample_collection_script_path: scripts/sample_collection.py
  sample_function_name: quality_gate_sample
  metric_table: ...
  dt: ...
  key_metric: roc_auc
  epsilon: 0.02
  metrics: ...
  metrics_params: ...
  additional_info:
    student_test: true

How does this all work and how is it connected to each other?

Before starting to output the model, DS starts training some baseline model. If the results obtained are not satisfactory (rogue features are noticed, the quality of the resulting model is unstable, the model is overtrained), then additional research and reconfiguration of the current config is required, enriching it with additional parameters.

After obtaining a high-quality model, it may be necessary to conduct a test pilot. To do this, the DS can obtain rates using the inference module and transmit them to decision-making systems.

If the final quality is satisfactory for DS and the business unit, then you can begin deploying the model to the execution system using the MLOps auto-retraining pipeline. You can read more about how the solution works infrastructurally in the article Automatic retraining of models in Production. Here we will briefly analyze the basic concept of work and what efforts are required from DS for output.

I would like to note that in order to create the pipeline, the MLOps team had to try very hard to change the IT regulations in the bank and convince everyone of the need for such a system.

Previously, the model image was built in the development environment and sent to the image store, after which it was retrieved for each stage and could only be rebuilt in the development environment. Now a universal image has appeared, and you can collect the necessary image directly in the industrial system during execution, replacing the old model with a new one.

The image below shows what main components the MLOps pipeline consists of, and how these components interact with each other.

Image

Main components of the MLOps pipeline

The pipeline contains components of training, inference, monitoring and comparison of models. In the deployment repo, all these modules are closed by the ARTEML library and placed as the initial template (if necessary, the ability to customize the code always remains with DS).

Basically, the current target process for outputting standard models does not require writing code; all control of these steps is done through the config. DS also changes the sample collection file to its own if necessary. Configures the logic for updating and generating dates for training, the control and inference stage, depending on the current execution date, now everything is controlled by changing parameters also in the auto-retraining config of the model repository. After which DS deploys the pipelines on the test circuit, and can also simulate the process of retraining the model in retrospect:

  • Train the model on retro dates.

  • Then mark the resulting model as a production one.

  • Start monitoring on current dates, after which, when the model degrades, a trigger should be triggered indicating failure of monitoring by metric and the start of retraining the model on current dates. After training, the model comparison task will be automatically launched. If the new version of the model is better, then the promotion of the new version of the model will automatically occur.

  • After DS can run inference and see the resulting model speeds

To set up calculations for production, DS sets a schedule for monitoring and inference data. Most often, monitoring is carried out several days before the inference, so that the model training pipeline and further promotion can be automatically completed, and a fresh slice of the target event is included in the calculations. When executing a dag of inference according to a schedule, the production version of the model is always taken.

Also, with each failure and passing of monitoring or quality gate, DS receives notifications by email and can study the received logs and results (view metrics tables, received artifacts in MLFlow for new training).

Deployment of a new version of the model without changing the program code occurs without review and FT procedures using a fast (automatic) process. If errors are detected in the automatically updated model, the ability to quickly roll back to the previous version has been worked out.

Necessary prerequisites

Automation of ML processes and creation of an auto-retraining system require a certain technological maturity. During 2022-2023, we have the specified new infrastructure for machine learning, but there is no need to immediately build an auto-retraining process if these automation tools are not available.

  • Feature Store. Here, all incoming data is automatically processed by etl processes and a list of features is extracted from them. There are now about 25,000 of them in total. Features are organized into long lists by customer groups (for example, a long list for legal entities contains over 3,000 features). And any model can “look” into the corresponding list if retraining is necessary.

  • Target Store. It contains all the necessary target events related to certain models for regular calculation.

  • Model development platform — service for DS (Jupyter notebook, MLFlow, Bitbucket, Airflow and others)

  • Unified Model Execution System — SIM, which facilitates the implementation of models and provides the possibility of test deployment of the model.

What did you get?

As a result, we have the AutoReTrainable ML Framework – the first fully automatic system for retraining models put into operation in the banking sector in Russia.

With its help, we reduced the resource costs of DS and other involved specialists for routine training and retraining of models, freeing up resources for research tasks.

AutoReTrainable ML Framework has three unique characteristics that are not found among other solutions for automating machine learning in banks in Russia.

  • This is a system for continuous monitoring of models operating in industrial systems and feedback.

  • End-to-end automation process – from highlighting features to seamlessly replacing the old model with a new one in a working business system. Plus minimal human participation in decision making.

  • But most importantly, the new retrained model is immediately rolled out into the industrial system – now there is no need to return it to the development environment and go through all the stages again.

The implementation increased the average annual quality of models by 5-7%. Without the use of retraining, the accuracy of model forecasts begins to decline after six months of use. Regular monitoring and timely retraining returns metrics to their original level. Typically one or two such procedures are required per year.

The projected financial effect for 2024 is savings of 200 million rubles.

In addition, it was possible to free up a significant amount of time for data analysts and machine learning engineers. And this, in turn, helped expand the number and range of tasks they solved and speed up many processes. Well, for clients, the good work of bank models is manifested in the corresponding service. They receive the necessary services that are of interest specifically to their offers, and reliable protection of funds in their accounts.

How work has changed for DS: comparison table.

Was

It became

Development of standard models in Jupyter notebooks, manual validation and high probability of error, need for review by team lead

Development of standard models through the library, the entry point is standard configs.

The approach to the selection of algorithms, preprocessing, artifact logging, and pipeline reproducibility lies with DS.

Algorithms and modeling steps are standardized at the library level, logging of stages and reproducibility of pipelines is ensured.

Manual analysis of monitoring metrics and decision-making on retraining.

Automatic calculation of metrics and reading of triggers at the monitoring stage to retrain the model.

Writing model deployment code and going through all the necessary stages to bring the model into production, in the case of manual model updating – manual re-deployment of the model.

Output on auto-retraining pipelines, setting up the necessary logic through the config (without changing the code).

In other words, the way of developing and rolling out the model has changed for DS. It has become simpler. And faster.

Acknowledgments

We express our gratitude to everyone involved in the creation of AutoReTrainable ML Framework.

  • Advanced Analytics Department team: Konstantin Chetin, Maxim Tyurikov, Andrey Miroshnichenko, Kirill Anpilov, Dmitry Tyrin, Timofey Lisochenko.

  • MLOps colleagues: Ekaterina Lazaricheva, Mark Kuznetsov, Maxim Sinyugin, Dmitry Goncharov, Ilya Myasnikov, Alexander Egorov, Artem Solovyov.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *