Introduction to MLflow

MLflow is a tool for managing the machine learning lifecycle: tracking experiments, managing and deploying models and projects. In this guide, we will look at how to organize experiments and runs, optimize hyperparameters with optuna, compare models and choose the best parameters. We will also look at logging models, using them in different formats, packaging a project in MLproject and installing a remote MLflow Tracking Server.

You can simply read the article or repeat all the steps locally. Also follow the link repositorywhich is an extended version of this guide in English in README.

Preparing env

First we need to install conda and create a new environment:

name: mlflow-example
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.11.4
  - pip=24.0
  - mlflow=2.14.2
  - xgboost=2.0.3
  - jupyter=1.0.0
  - loguru=0.7.2
  - shap=0.45.1
  - pandas=2.2.2
  - scikit-learn=1.5.1
  - scipy=1.14.0
  - numpy=1.26.4
  - jupytext=1.16.3
  - psutil=6.0.0
  - boto3=1.34.148
  - psycopg2-binary=2.9.9
  - pip:
    - optuna==3.6.1
    - mlserver==1.3.5
    - mlserver-mlflow==1.3.5
    - mlserver-xgboost==1.3.5

You can install it manually or save this file locally and call it:

conda env create -f conda.yaml

Now you can use this environment in the IDE or in the terminal:

conda activate mlflow-example

MLflowUI

Run in terminal mlflow uito run MLflow on localhost:5000.

Mlflow will run in local mode, by default it will create a folder mlruns for storing artifacts and metainformation.

MLflowUI

MLflowUI

Mlflow Projects

MLflow Project is an approach to structuring a project, following conventions that make it easy for other developers (or automated tools) to run the project. Each project is a directory of files or a git repository containing your code. MLflow can run these projects based on certain rules for organizing files in the directory.

The MLproject file helps MLflow and other users understand and run your project by specifying the environment, entry points, and possible settings to configure:

name: Cancer_Modeling

conda_env: conda.yaml

entry_points:
  data-preprocessing:
    parameters:
      test-size: {type: float, default: 0.33}
    command: "python data_preprocessing.py --test-size {test-size}"
  hyperparameters-tuning:
    parameters:
      n-trials: {type: int, default: 10}
    command: "python hyperparameters_tuning.py --n-trials {n-trials}"
  model-training:
    command: "python model_training.py"
  data-evaluation:
    parameters:
      eval-dataset: {type: str}
    command: "python data_evaluation.py --eval-dataset {eval-dataset}"

We can run project endpoints using either the command line interface (CLI) or the Python API.

Mlflow Experiments and Mlflow Runs

MLflow Experiments and MLflow Runs are the main abstractions for structuring the project. Let's look at an example of developing a model on an open dataset with data on cancer incidence.

Data preprocessing

This step is usually more complex, but here we simply load the dataset, split it into training and testing sets, log some metrics in MLflow (number of samples and features), and pass the datasets themselves into MLflow artifacts.

import sys
import argparse
import mlflow
import warnings
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import datasets
from loguru import logger

from config import config

# set up logging
logger.remove()
logger.add(sys.stdout, format="{time:YYYY-MM-DD HH:mm:ss} | {level} | {message}")
warnings.filterwarnings('ignore')

def get_cancer_df():
    cancer = datasets.load_breast_cancer()
    X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
    y = pd.Series(cancer.target)
    logger.info(f'Cancer data downloaded')
    return X, y


if __name__ == '__main__':
    
    parser = argparse.ArgumentParser()
    parser.add_argument("--test-size", default=config.default_test_size, type=float)
    TEST_SIZE = parser.parse_args().test_size
        
    logger.info(f'Data preprocessing started with test size: {TEST_SIZE}')
        
    # download cancer dataset
    X, y = get_cancer_df()

    # add additional features
    X['additional_feature'] = X['mean symmetry'] / X['mean texture']
    logger.info('Additional features added')

    # log dataset size and features count
    mlflow.log_metric('full_data_size', X.shape[0])
    mlflow.log_metric('features_count', X.shape[1])

    # split dataset to train and test part and log sizes to mlflow
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE)
    mlflow.log_metric('train_size', X_train.shape[0])
    mlflow.log_metric('test_size', X_test.shape[0])
    
    # log and register datasets
    train = X_train.assign(target=y_train)
    mlflow.log_text(train.to_csv(index=False),'datasets/train.csv')
    dataset_source_link = mlflow.get_artifact_uri('datasets/train.csv')
    dataset = mlflow.data.from_pandas(train, name="train", targets="target", source=dataset_source_link)
    mlflow.log_input(dataset)

    test = X_test.assign(target=y_test)
    mlflow.log_text(test.to_csv(index=False),'datasets/test.csv')
    dataset_source_link = mlflow.get_artifact_uri('datasets/test.csv')
    dataset = mlflow.data.from_pandas(train, name="test", targets="target", source=dataset_source_link)
    mlflow.log_input(dataset)
    
    logger.info('Data preprocessing finished')

To execute this code, save it to a file. data-preprocessing.py next to the file MLproject and in the command line call:

mlflow run mlproject --entry-point data-preprocessing --env-manager local --experiment-name Cancer_Classification --run-name Data_Preprocessing -P test-size=0.33

This run should now be accessible via the ui. The datasets used section is populated thanks to the mlflow.data api. However, it is important to understand that this is metadata about the data, not the data itself.

The mlflow.data module tracks information about datasets during model training and evaluation: features, targets, predictions, name, schema, and source. This metadata is logged using mlflow.log_input().

Hyperparameter tuning

In this part we use optuna to find the best hyperparameters for XGBoost using built-in cross-validation to train and evaluate the model. We will also look at how to collect metrics during the training process.

Here we use conda as env manager: mlflow will automatically create a new env for us from the file specified in the config MLproject.

mlflow run mlproject --entry-point hyperparameters-tuning --env-manager conda --experiment-name Cancer_Classification --run-name Hyperparameters_Search -P n-trials=10

Let's look at the results in the MLflow UI. You can use nested runs to structure your project. There is one main run to tune the hyperparameters, and all the tests are collected as nested runs. MLflow also provides the ability to customize the columns and row order in this view:

In the chart view, you can compare runs and customize different charts. Using XGBoost callbacks to log metrics during model training allows you to create charts with the number of trees on the x-axis.

Select multiple runs, click the compare button, and select the most useful view. The mlflow compare function can be useful when optimizing hyperparameters, as it helps refine and adjust the boundaries of possible intervals based on the comparison results.

System metrics can also be monitored throughout the run. While this does not provide an accurate estimate of the actual requirements, it can still be useful in some cases.

Log and register model

It is possible, but not necessary, to save the model for each experiment and run. In most cases, it is better to save the parameters and then, having selected the best ones, perform an additional run to save the model. Here we follow the same logic: use the parameters from the best run to save the final model and register it for versioning and use via a short link.

import os
import sys
import tempfile
import mlflow
import warnings
import logging
import xgboost as xgb
import pandas as pd
from loguru import logger


# set up logging
warnings.filterwarnings('ignore')
logging.getLogger('mlflow').setLevel(logging.ERROR)
logger.remove()
logger.add(sys.stdout, format="{time:YYYY-MM-DD HH:mm:ss} | {level} | {message}")


if __name__ == '__main__':

    logger.info('Model training started')
 
    mlflow.xgboost.autolog()

    with mlflow.start_run() as run:

        experiment_id = run.info.experiment_id
        
        run_id = run.info.run_id
        logger.info(f'Start mlflow run: {run_id}')
        
        # get last finished run for data preprocessing
        last_data_run_id = mlflow.search_runs(
            experiment_ids=[experiment_id],
            filter_string=f"tags.mlflow.runName="Data_Preprocessing" and status="FINISHED"",
            order_by=["start_time DESC"]
        ).loc[0, 'run_id']
    
        # download train and test data from last run
        with tempfile.TemporaryDirectory() as tmpdir:
            mlflow.artifacts.download_artifacts(run_id=last_data_run_id, artifact_path="datasets/train.csv", dst_path=tmpdir)
            mlflow.artifacts.download_artifacts(run_id=last_data_run_id, artifact_path="datasets/test.csv", dst_path=tmpdir)
            train = pd.read_csv(os.path.join(tmpdir, 'datasets/train.csv'))
            test = pd.read_csv(os.path.join(tmpdir, 'datasets/test.csv'))

        # convert to DMatrix format
        features = [i for i in train.columns if i != 'target']
        dtrain = xgb.DMatrix(data=train.loc[:, features], label=train['target'])
        dtest = xgb.DMatrix(data=test.loc[:, features], label=test['target'])

        # get last finished run for hyperparameters tuning
        last_tuning_run = mlflow.search_runs(
            experiment_ids=[experiment_id],
            filter_string=f"tags.mlflow.runName="Hyperparameters_Search" and status="FINISHED"",
            order_by=["start_time DESC"]
        ).loc[0, :]
        
        # get best params
        params = {col.split('.')[1]: last_tuning_run[col] for col in last_tuning_run.index if 'params' in col}
        params.update(eval_metric=['auc', 'error'])

        mlflow.log_params(params)
        
        model = xgb.train(
            dtrain=dtrain,
            num_boost_round=int(params["num_boost_round"]),
            params=params,
            evals=[(dtest, 'test')],
            verbose_eval=False,
            early_stopping_rounds=10
        )

        mlflow.log_metric("accuracy", 1 - model.best_score)
        
        # Log model as Booster
        input_example = test.loc[0:10, features]
        predictions_example = pd.DataFrame(model.predict(xgb.DMatrix(input_example)), columns=['predictions'])
        mlflow.xgboost.log_model(model, "booster", input_example=input_example)
        mlflow.log_text(predictions_example.to_json(orient="split", index=False), 'booster/predictions_example.json')

        # Register model
        model_uri = f"runs:/{run.info.run_id}/booster"
        mlflow.register_model(model_uri, 'CancerModelBooster')
        
        # Log model as sklearn completable XGBClassifier
        params.update(num_boost_round=model.best_iteration)
        model = xgb.XGBClassifier(**params)
        model.fit(train.loc[:, features], train['target'])
        mlflow.xgboost.log_model(model, "model", input_example=input_example)

        # log datasets
        mlflow.log_text(train.to_csv(index=False), 'datasets/train.csv')
        mlflow.log_text(test.to_csv(index=False),'datasets/test.csv')

        logger.info('Model training finished')

        # Register the model
        model_uri = f"runs:/{run.info.run_id}/model"
        mlflow.register_model(model_uri, 'CancerModel')
        
        logger.info('Model registered')
mlflow run mlproject --entry-point model-training --env-manager conda --experiment-name Cancer_Classification --run-name Model_Training -P n-trials=10 

Thanks to the function mlflow.xgboost.autolog()all metrics are automatically logged, including during the training process:

Once the model is saved, we can access the artifacts page inside the run. Artifacts can store any type of file, such as custom graphs, text files, images, datasets, Python scripts, and more. For example, I converted a work notebook to HTML and saved it as an artifact:

For each model, MLflow automatically creates a YAML configuration file called MLmodel. This file can be viewed in the MLflow interface or downloaded and examined.

artifact_path: model
flavors:
  python_function:
    data: model.xgb
    env:
      conda: conda.yaml
      virtualenv: python_env.yaml
    loader_module: mlflow.xgboost
    python_version: 3.11.4
  xgboost:
    code: null
    data: model.xgb
    model_class: xgboost.sklearn.XGBClassifier
    model_format: xgb
    xgb_version: 2.0.3
mlflow_version: 2.14.2
model_size_bytes: 35040
model_uuid: 516954aae7c94e91adeed9df76cb4052
run_id: bf212703d1874eee9dcb37c8a92a6de6
saved_input_example_info:
  artifact_path: input_example.json
  pandas_orient: split
  type: dataframe
signature:
  inputs: '[{"type": "double", "name": "mean radius", "required": true}, {"type":
    "double", "name": "mean texture", "required": true}, {"type": "double", "name":
    "mean perimeter", "required": true}, {"type": "double", "name": "mean area", "required":
    true}, {"type": "double", "name": "mean smoothness", "required": true}, {"type":
    "double", "name": "mean compactness", "required": true}, {"type": "double", "name":
    "mean concavity", "required": true}, {"type": "double", "name": "mean concave
    points", "required": true}, {"type": "double", "name": "mean symmetry", "required":
    true}, {"type": "double", "name": "mean fractal dimension", "required": true},
    {"type": "double", "name": "radius error", "required": true}, {"type": "double",
    "name": "texture error", "required": true}, {"type": "double", "name": "perimeter
    error", "required": true}, {"type": "double", "name": "area error", "required":
    true}, {"type": "double", "name": "smoothness error", "required": true}, {"type":
    "double", "name": "compactness error", "required": true}, {"type": "double", "name":
    "concavity error", "required": true}, {"type": "double", "name": "concave points
    error", "required": true}, {"type": "double", "name": "symmetry error", "required":
    true}, {"type": "double", "name": "fractal dimension error", "required": true},
    {"type": "double", "name": "worst radius", "required": true}, {"type": "double",
    "name": "worst texture", "required": true}, {"type": "double", "name": "worst
    perimeter", "required": true}, {"type": "double", "name": "worst area", "required":
    true}, {"type": "double", "name": "worst smoothness", "required": true}, {"type":
    "double", "name": "worst compactness", "required": true}, {"type": "double", "name":
    "worst concavity", "required": true}, {"type": "double", "name": "worst concave
    points", "required": true}, {"type": "double", "name": "worst symmetry", "required":
    true}, {"type": "double", "name": "worst fractal dimension", "required": true},
    {"type": "double", "name": "additional_feature", "required": true}]'
  outputs: '[{"type": "long", "required": true}]'
  params: null
utc_time_created: '2024-07-30 08:38:11.745511'

File MLmodel contains information about the flavors, here pyfunc and xgboost. It also includes environment settings for conda (conda.yaml) and virtualenv (python_env.yaml). The model is an XGBoost classifier compatible with the sklearn API, saved in the XGBoost version 2.0.3 format. The file tracks details such as the model size, UUID, run ID, and creation time. There is also a sample input for the model and its signature – the specification of the input and output. The signature can be created and saved manually with the model, but MLflow automatically generates a signature when provided with a sample input.

In the MLflow ecosystem, flavors are wrappers for specific machine learning libraries that allow models to be saved, logged, and retrieved in a single format. This ensures consistent predict behavior across frameworks, making it easier to manage and deploy models.

Adding predictions for sample input data can be useful as it allows you to immediately test the model after loading it and ensure that it works correctly in different environments.

We kept two versions of the model, both using xgboost, each with two “flavors”: python_function And xgboost. The difference lies in the model class: for a booster it is xgboost.core.Boosterand for the model it is xgboost.sklearn.XGBClassifierwhich supports an API compatible with scikit-learn. These differences affect the operation of the method predictso it is important to review the MLmodel file and check the model signature before using it. There may also be slight differences in performance. python_function usually a little slower.

When loading a booster model using xgboost, the model expects the input to be in the form of a DMatrix object, and the predict method in our case will output scores (not classes).

Built-in evaluation

Built-in function mlflow.evaluate allows you to evaluate models on additional datasets:

import sys
import os
import argparse
import warnings
import logging
import mlflow
import pandas as pd
from loguru import logger


logger.remove()
logger.add(sys.stdout, format="{time:YYYY-MM-DD HH:mm:ss} | {level} | {message}")
warnings.filterwarnings('ignore')
logging.getLogger('mlflow').setLevel(logging.ERROR)


if __name__ == '__main__':

    logger.info('Evaluation started')

    parser = argparse.ArgumentParser()
    parser.add_argument("--eval-dataset", type=str)
    eval_dataset = pd.read_csv(parser.parse_args().eval_dataset)
        
    with mlflow.start_run() as run:
        
        eval_dataset = mlflow.data.from_pandas(
            eval_dataset, targets="target"
        )
        last_version = mlflow.MlflowClient().get_registered_model('CancerModel').latest_versions[0].version
        mlflow.evaluate(
            data=eval_dataset, model_type="classifier", model=f'models:/CancerModel/{last_version}'
        )
        logger.success('Evaluation finished')
mlflow run mlproject --entry-point data-evaluation --env-manager conda --experiment-name Cancer_Classification --run-name Data_Evaluation -P eval-dataset="test.csv"

The results can be viewed in the MLflow interface, where various metrics and graphs are presented, including ROC-AUC, confusion matrices, and SHAP plots (if SHAP is installed).

Model Serving

MLflow has built-in capabilities for deploying models. Deploying a model using Flask is quite simple: mlflow models serve -m models:/CancerModel/1 --env-manager local. But we can also use mlserver.

Plastic bag mlserver facilitates efficient deployment and maintenance of machine learning models with support for multiple frameworks using REST and gRPC interfaces. It integrates with Seldon Core for scalable and reliable model management and monitoring.

You may need to install the following packages: mlserver, mlserver-mlflow, mlserver-xgboostif you are using your own environment. After that, we can set up the configuration file (model-settings.json) for MLServer. This allows us to change how the API works; here we are simply setting up an alias for the model:

{
    "name": "cancer-model",
    "implementation": "mlserver_mlflow.MLflowRuntime",
    "parameters": {
        "uri": "models:/CancerModel/1"
    }
}

To start MLServer in a local environment, you can use the command mlserver start .

Now we have a working API with OpenAPI documentation, request validation, HTTP and gRPC servers and an endpoint with metrics for prometheus. And all this without writing code, all you need is a simple configuration in JSON format.

We can check the documentation for our model and examine the expected data structure via Swagger at /v2/models/cancer/model/docs

We can access the endpoint with metrics or configure prometheus to collect them

Now you can send queries to the model:

import requests
import json

url = "http://127.0.0.1:8080/invocations"

# convert df do split format and then to json
input_data = json.dumps({
    "params": {
      'method': 'proba',  
    },
    'dataframe_split': {
        "columns": test.columns.tolist(),
        "data": test.values.tolist()
    }
})

# Send a POST request to the MLflow model server
response = requests.post(url, data=input_data, headers={"Content-Type": "application/json"})

if response.status_code == 200:
    prediction = response.json()
    print("Prediction:", prediction)
else:
    print("Error:", response.status_code, response.text)
Prediction: {'predictions': [1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1]}

Customize model

We can customize our model to provide probabilities or include additional logging. To do this, first download the model and then wrap it in a custom wrapper:

import mlflow
import mlflow.xgboost
import mlflow.pyfunc

# Step 1: Download the Existing Model from MLflow
model_uri = "models:/CancerModel/1"
model = mlflow.xgboost.load_model(model_uri)


# Step 2: Define the Custom PyFunc Model with `loguru` Setup in `load_context`
class CustomPyFuncModel(mlflow.pyfunc.PythonModel):
    
    def __init__(self, model):
        self.model = model
        
    def get_logger(self):
        from loguru import logger
        logger.remove()
        logger.add("mlserve/mlserver_logs.log", format="{time:YYYY-MM-DD HH:mm:ss} | {level} | {message}")
        return logger
        
    def load_context(self, context):
        self.logger = self.get_logger()

    def predict(self, context, model_input):
        
        self.logger.info(f"start request")
        self.logger.info(f"batch size: {len(model_input)}")
        
        predict =  self.model.predict_proba(model_input)[:,1]
        
        self.logger.success(f"Finish request")
        
        return predict
        

# Step 3: Save the Wrapped Model Back to MLflow
with mlflow.start_run() as run:
    mlflow.pyfunc.log_model(
        artifact_path="custom_model",
        python_model=CustomPyFuncModel(model),
        registered_model_name="CustomCancerModel",
    )

Now let's rewrite the configuration file and restart mlserver:

{
    "name": "cancer-model",
    "implementation": "mlserver_mlflow.MLflowRuntime",
    "parameters": {
        "uri": "models:/CustomCancerModel/1"
    }
}

Let's send requests to the new API:

# Send a POST request to the MLflow model server
response = requests.post(url, data=input_data, headers={"Content-Type": "application/json"})

if response.status_code == 200:
    prediction = response.json()
    print("Prediction:", prediction['predictions'][:10])
else:
    print("Error:", response.status_code, response.text)
Prediction: [0.9406537413597107, 0.9998677968978882, 0.9991995692253113, 0.00043031785753555596, 0.9973010420799255, 0.9998010993003845, 0.9995433688163757, 0.9998323917388916, 0.0019207964651286602, 0.0004339178267400712]

Now we have, most likely, we can also check that additional logs are written to the file:

2024-07-30 12:02:38 | INFO | start request
2024-07-30 12:02:38 | INFO | batch size: 188
2024-07-30 12:02:38 | SUCCESS | Finish request

An alternative to this approach is to write a new API from scratch. This may be better when we need more flexibility, additional functionality, or when we are limited in time, as using MLServer may be a bit slower.

In case of periodic retraining of the model, we need to update the model in the service or restart the deployment in the open source version of MLflow, I did not find such functionality. The Databricks version includes a webhooks feature that allows MLflow to notify the API about new versions. Another option is to trigger the deployment when the model is updated. We can also open an additional endpoint in the server and call it within the DAG, or configure the server to periodically request updates from MLflow.

MLflow Tracking Server

MLflow Local Setup

Before this, we worked with MLflow locally: metadata and artifacts are stored in the default folder mlruns. You can check this folder yourself if you have successfully completed all the previous steps. We can change the location of MLflow metadata storage by specifying another backend-store-uri when running the MLflow UI command. For example, to use a different folder (mlruns_new), follow these steps: mlflow ui --backend-store-uri ./mlruns_new And change the tracking uri in the project: mlflow.set_tracking_uri("file:./mlruns_new").

Remote Tracking

In production, we typically set up a remote tracking server with an artifact store and a database for MLflow metadata. We can simulate this setup using MinIO for artifact storage and PostgreSQL for the database. Here is a simple Docker Compose file with 4 services:

  1. MinIO as s3 like storage

  2. MinIO client (minio/mc) for creating a bucket for MLflow

  3. PostgreSQL as a Database for Meta-Information

  4. MLflowUI

services:

  s3:
    image: minio/minio
    container_name: mlflow_s3
    ports:
      - 9000:9000
      - 9001:9001
    command: server /data --console-address ':9001'
    environment:
      - MINIO_ROOT_USER=mlflow
      - MINIO_ROOT_PASSWORD=password
    volumes:
      - minio_data:/data

  init_s3:
    image: minio/mc
    depends_on:
      - s3
    entrypoint: >
      /bin/sh -c "
      until (/usr/bin/mc alias set minio http://s3:9000 mlflow password) do echo '...waiting...' && sleep 1; done;
      /usr/bin/mc mb minio/mlflow;
      exit 0;
      "

  postgres:
    image: postgres:latest
    ports:
      - 5432:5432
    environment:
      - POSTGRES_USER=mlflow
      - POSTGRES_PASSWORD=password
    volumes:
      - postgres_data:/var/lib/postgresql/data

  mlflow:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - 5050:5000
    environment:
      - MLFLOW_S3_ENDPOINT_URL=http://s3:9000
      - AWS_ACCESS_KEY_ID=mlflow
      - AWS_SECRET_ACCESS_KEY=password
    command: >
      mlflow server
        --backend-store-uri postgresql://mlflow:password@postgres:5432/mlflow
        --default-artifact-root s3://mlflow/
        --artifacts-destination s3://mlflow/
        --host 0.0.0.0
    depends_on:
      - s3
      - postgres
      - init_s3

volumes:
  postgres_data:
  minio_data:

Call these commands to create an image and launch containers docker compose -f tracking_server/docker-compose.yml build and docker compose -f tracking_server/docker-compose.yml up.

Instead of processing artifacts, MLflow provides a link to the client, allowing you to save and load artifacts directly from the artifact store. Therefore, you need to configure access keys and the tracking server host to log artifacts correctly.

import os
import mlflow

os.environ['AWS_ACCESS_KEY_ID'] = 'mlflow'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'password'
os.environ['MLFLOW_TRACKING_URI'] = 'http://localhost:5050'
os.environ['MLFLOW_S3_ENDPOINT_URL'] = 'http://localhost:9000'

# run data preprocessing step one more time
mlflow.run(
    uri = '.',
    entry_point="data-preprocessing",
    env_manager="local",
    experiment_name="Cancer_Classification",
    run_name="Data_Preprocessing",
    parameters={'test-size': 0.5},
)

After this step, we can go to the MLflow UI to verify that the run and artifacts were successfully registered:

And verify via the MinIO interface that the artifacts were successfully saved to the bucket:

And also query our PostgreSQL database to make sure it is used to store metadata:

import psycopg2
import pandas as pd

conn = psycopg2.connect(dbname="mlflow", user="mlflow", password='password', host="localhost", port="5432")
try:
    query = "SELECT experiment_id, name FROM experiments"
    experiments_df = pd.read_sql(query, conn)
except Exception as e:
    print(e)
else:
    print(experiments_df)
finally:
    conn.close()

In production, authentication, different levels of user access, and monitoring for MLflow, bucket, database, and specific artifacts may be required. We won't cover these aspects here, and some of them don't have built-in support in MLflow.

We looked at how you can use MLflow both locally for yourself and for joint work on projects, briefly went over the main functionality and used it in examples. You can see the full code in this repositoriesyou can also take a look Mlflow documentation.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *