Automatic retraining of models in Production

Machine learning models are becoming critical to businesses, helping to streamline processes and make more informed decisions. However, their relevance and accuracy can quickly decline as data changes. Automatic retraining of models in production solves this problem, ensuring models are updated and improved without significant time investment.

In this article, we will look at the process of automatically retraining ML models in production using MLOps tools. We'll discuss integrating tools like AirFlow and Spark with CI/CD pipelines, as well as creating a configuration module that allows developers to focus on models without delving into infrastructure details.

Typically, DevOps pipelines are built on the basis of CI/CD tools, such as image building and deployment on Kubernetes. However, in the context of machine learning, more specific tools are required, such as AirFlow and Spark.

To integrate Spark with Kubernetes, we use special operators that control the launch of SparkApplication and the creation of Spark sessions within a Pod. AirFlow works with DAGs – acyclic graphs described in Python. These technologies have their own ways of running: from individual Kubernetes manifests to Python scripts.

So, to train and run models, we have developed a Python module, which, based on the input configuration from the developer, allows you to train and launch models without deep knowledge of the infrastructure. The module serves as a link between DS's knowledge of the model and the infrastructure.

DS create a configuration file in their repository to work with our module. The file contains information about the model: time and interval for launching DAGs, parameters for the Spark session, Python version for running the model and other parameters for training, calculation and monitoring. The task of the module is to take a user configuration, create a Spark Application, DAG based on it, and place it all in AirFlow Scheduler as part of a pipeline written in Jenkins.


Untitled

Jenkins pipeline for generating manifests and DAGs

The process starts with uploading the code and testing it. After checking the code, the model environment is assembled and an archive with it is saved in S3 storage.

You might be surprised and ask, “Where are the Docker images?” Here we abandoned the standard approach of building Docker images in favor of storing tar.gz archives for the following reasons:

  • Speed: Building and downloading the archive takes less time than creating and downloading large Docker images.

  • Debugging: having an archive with the environment, you can easily download it, for example, into JupyterLab, create a new environment and test the model.

After preparing the environment and saving it, the MLOps module creates the necessary manifests based on the user configuration and passes them to AirFlow. The Jenkins pipeline completes its work here and all further actions take place in AirFlow.

Main processes inside DAG

Main processes inside DAG

Now we have a ready-made environment for the model to work, a DAG for AirFlow and a manifest for running SparkApplication in the cluster. We can start training.

As part of the DAG, we use the popular Spark Operator, which creates Spark sessions and waits for them to complete.

  • The first task launches a Spark session: a Spark Application entity is created in Kubernetes, the manifest of which was passed to the Scheduler at the previous stage. Spark Application creates a Pod in which model training begins.

  • The second Spark Wait task monitors the state of the Spark Application after it is created.

After training is completed, the model's accuracy is also checked (quality gate). If the model passes the test, it is marked in MLFlow with the appropriate label and can participate in further rounds of running and retraining. For inference and monitoring, the DAG has a similar structure.


Let's take another look at the top level of what's going on.

We have a configuration file from DS, on the basis of which we prepare a DAG in the Jenkins pipeline. In our approach, the repository with the DS model contains scripts for training, inference, quality control and monitoring.

If we have a training pipeline running, then training and subsequent quality control are launched in the DAG. If we have an inference pipeline running, then the corresponding script is launched. The same thing happens with monitoring. Parameters for all DAGs and SparkApplications are set depending on what is specified in the configuration from DS.

Thus, we have implemented the processes of both training and launching models with checking their quality. If the model loses its accuracy, it can track the monitoring DAG and subsequently trigger its automatic retraining and calculation.


However, after setting up the processes, the question arises: how to combine the work of pipelines on several stands to apply retraining in production? Here we are helped by a technical pipeline that promotes models at stands.

As previously mentioned, in the repository with the model, DS has scripts for implementing the logic of training, layout, quality control and monitoring. After training and quality control, the script returns the result and contacts our pipeline with the promoter.

In the latest versions of MLFlow, tags and aliases have been added for different versions of the model. Tags in MLFlow allow you to add metadata to models and their versions to describe and classify them, making them easier to find and manage. Aliases are used to create convenient links to specific versions of models, which makes it easier to switch between them and manage their states (for example, production, staging).

The promoter's logic is simple:

  • If the model is trained well, the tag is set success and an alias is set with the name of the stand, which cannot be repeated for other versions of the same model and clearly marks the current one. Next, the layout pipeline is launched and the environment is copied to S3 storage at the next stand.

  • If a model does not pass the quality check, it is marked with a tag failedbut continues to exist.

Untitled

Life cycle of models on environments

We have four environments: MDP development (model development platform), two for testing SIM (system of inference models) INT/LT and production SIM Prom. A model is created in the DS development environment and its training is started. After checking the quality, the promoter launches the inference and simultaneously sends the entourage to the next stand.

The model is launched and tested in development and testing environments and is consistently pushed into production. When the model successfully passes testing and can withstand the load, it is launched in production and operates normally with periodic monitoring checks. If accuracy drops, monitoring tracks this, the model is retrained and restarted.


Automatically retraining machine learning models in production is an important step to ensure they are up to date and accurate. In this article, we examined the key aspects of implementing this process into the fairly popular AirFlow and Spark combination.

We looked at how the infrastructure and integration of specialized tools can be organized to minimize developer involvement in infrastructure tasks and allow them to focus on creating and improving models.

The use of these practices and tools can significantly improve the performance and reliability of machine learning models, as well as speed up the process of their implementation in production. This, in turn, contributes to more efficient use of data and improved quality of business processes.

We hope that the recommendations and examples provided will help you implement automatic model retraining in your organization so that you can get the most out of it.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *