easy to change and run ML experiments

Introduction

Typically, working on ML projects involves a large number of experiments with various data sets, model architectures, parameters and optimization methods. In order to effectively manage all these changes and results, it is necessary to use tools that will help organize and structure the process.

It is desirable to have a simple and quick way to change some parameters, look at the results of the work, and record changes if successful. In some cases, an external person who was not involved in the development of the project in the first place must do the same. The simpler it is, the better.

An approximate work cycle is shown in Fig. 1:

Rice.  1. The main work cycle (the image can be enlarged).

Rice. 1. The main work cycle (the image can be enlarged).

It is assumed that the project initially uses Git as a version control system and the reader has some knowledge of the tools discussed.

The question remains about how best to store configuration files and organize the launch of experiments.

Running experiments

To run experiments, you can write your own logic, but you need to spend time on this, and if you need to use this approach in different projects, you still need to think about universality. Another option is to use some ready-made tools from advanced MLOps solutions. This is usually needed when there is a need to run distributed experiments or support for complex scenarios is needed.

This article discusses the option of running locally with minimal settings and dependencies on external systems. Suitable for this DVC. DVC can be used for different purposes:

  • Working with data storage and versioning.

  • Launching pipelines.

  • Tracking metrics, parameters and comparing them.

  • Managing and running experiments.

Within the framework of this article, the main interest will be capabilities for managing and running experiments. In DVC, you can organize experiments as a task queue. The task is to launch stages from dvc.yaml with some code running. When all tasks are completed, all that remains is to select the best experiment and apply the changes to the main code base. With a certain organization of the project, most of it can be automated. Some information on working with experiments is in the article.

Let’s say there is a parameter alpha on some model. There is already written code and a stage has been defined. train. Need to check train for different parameter values.

The basic way to configure the behavior of the experiments themselves is to change the settings in the configuration files. To simplify the work, it is necessary to support maximum customization capabilities through configuration files without changes to the code. This requires a flexible system for application configuration.

Project configuration management

Typically, configuration files are stored as a set of files in popular formats: json, yaml, toml, ini, etc. These files store some settings for the entire project. In general, it is easier to manage one file rather than a collection of them.

The set of requirements for a configuration management solution is as follows:

  • All final settings should be in one file.

  • The ability to logically group files and create a final file from parts. Usually it is easier for a person to navigate and change some individual settings in small files than in one large file.

  • Reusing some values ​​within the configuration file to reduce duplication.

Hydra just allows you to do what you need. DVC has appropriate integration.

It will work as follows. For example, the configuration files are organized as in the example below:

├───configs
│   │   train.yaml
│   │
│   ├───dataset
│   │       bin_cls.yaml
│   │       bin_cls_noise.yaml
│   │
│   └───model
│           dec_tree.yaml
│           log_reg.yaml

In file train.yaml indicated which parts to use:

...
defaults:
  - model: log_reg
  - dataset: bin_cls
  - _self_

...

Hydra will take a set of files and transform it as if it were a single file with the following structure:

model:
    ... # содержимое bin_cls.yaml
...

dataset:
    ... # содержимое log_reg.yaml

...

Hydra is based on OmegaConfso some functions from OmegaConf will also work in Hydra.

Eat ability to reuse values from different parts.

More details about other aspects.

By adding new files to the appropriate groups, you can expand the project within certain limits without rewriting the code.

Case Study

The entire example is available in the GitHub repository.

The repository has a small example of organizing a project with DVC and Hydra to solve a binary classification problem on a synthetic data set. The structure of the configuration files is as follows:

│   train.yaml
│
├───dataset
│       bin_cls.yaml
│       bin_cls_noise.yaml
│
└───model
        dec_tree.yaml
        log_reg.yaml

There are two data sets for binary classification: basic (bin_cls) and additionally with noisy class labels (bin_cls_noise). Two models: decision tree (dec_tree) and logistic regression (log_reg). Main configuration file: train.yaml.

We want to test how well different models perform under different parameters on the bin_cls dataset or bin_cls_noise. There are two options for how to organize this in DVC:

  1. Use hydra composition.

  2. Don’t use hydra composition and add an intermediate step that creates the final configuration file.

All options have their advantages and disadvantages. In any case, you can easily add a new model while maintaining the interface agreement

Hydra composition

When this mode is enabled, DVC, before launching the specified stage, will use Hydra to collect the final configuration file and save it to a file params.yaml.

Pros:

  • The complete configuration file will be generated automatically.

  • IN dvc.yaml you can use the values ​​of all parameters that are in the file. For example:

stages:
  train:
    cmd: python ./train.py
    deps:
        - ${data_dir} # Из params.yaml
    outs:
        - ${exp_dir} # Из params.yaml
  • When changing any parameter, the full configuration file will be automatically updated and DVC will understand when to restart the stage if the file params.yaml listed as deps V dvc.yaml.

  • You can use the section params to indicate which parts of the configuration file are parameters.

  • Using dvc exp run -S ... indicating the enumeration of parameters file params.yaml will contain the current values.

Minuses:

  • Lack of full control over how the configuration file is created and resolved links inside it.

  • Some values ​​cannot be set in advance. For example, they are determined dynamically based on data. In this case, you will have to fill in the value ??? and in params.yaml this value will remain so. You’ll have to process it in code.

  • If there are several stages, then all stages will have to accept the complete configuration file and take the part that is needed. In some cases, you can get by by changing the values ​​in dvc.yaml.

Without Hydra composition with intermediate stage

Here you need to write an additional stage that emulates the work of what DVC does with Hydra composition.

An example code is:

Sample code to get the full configuration file
import hydra
from omegaconf import OmegaConf


@hydra.main(version_base="1.3", config_path="configs", config_name="main")
def main(config):
    ...
    OmegaConf.save(config, ..., resolve=True)

Next in dvc.yaml the resulting file is indicated as a dependency and in the section varsso that you can use the values ​​from it:

stages:
  train:
    vars:
      - путь_до_созданного_файла
    cmd: python ./train.py
    deps:
        - путь_до_созданного_файла

The main advantages are the same as with Hydra composition. Additionally, it becomes possible to define some dynamic values ​​and influence the formation of the final configuration file.

This can be combined using separate steps without being tied to the main configuration file. For example, dvc.yaml consists of the following stages as in Fig. 2.

Rice.  2. Some stages (the image can be enlarged).

Rice. 2. Some stages (the image can be enlarged).

Stages A and B can accept their own configuration files. For example, these are some common actions for the project. Step C generates the complete configuration file. Stage D is the stage that will run through dvc exp run ....

Minuses:

  • When starting experiments, you need to specify which file to change by doing dvc exp runotherwise a configuration file with the same values ​​will be generated.

  • If there are several stages, then all stages will have to accept the complete configuration file and take the part that is needed.

  • Maybe, dvc exp run you’ll have to start from force-downstreamso that the step responsible for creating the configuration file is always executed and creates an up-to-date configuration file.

In the example from the repository, the Hydra composition option is used as it is a good starting point.

It is necessary to check how models with different sets of parameters work. Since the structure of the configuration file changes and the parameters also differ, it will not be possible to add all the experiments at once because the only way to run experiments at the time of writing was a complete search of all parameters with all, but this can be done in several commands:

dvc exp run --queue -S model=log_reg -S model.C=1.0e-2,100  train
dvc exp run --queue -S model=dec_tree -S model.max_depth=6,12  train

The output will be something like this:

Queueing with overrides '{'params.yaml': ['model=log_reg', 'model.C=0.01']}'.
Queued experiment 'lyric-feel' for future execution.
Queueing with overrides '{'params.yaml': ['model=log_reg', 'model.C=100']}'.
Queued experiment 'kooky-taka' for future execution.

All experiments have been added to the task queue. You can view the status via dvc queue status. Sample output:

Task     Name        Created    Status
52dbbd4  lyric-feel  01:21 PM   Queued
6ef169b  kooky-taka  01:21 PM   Queued
610131c  meaty-name  01:21 PM   Queued
3e6aab7  famed-life  01:21 PM   Queued

Worker status: 0 active, 0 idle

There are currently 0 active performers. To start calculations, you need to add executors:

dvc queue start -j 2

To speed up calculations, we will add two executors. After this the output dvc queue status will be something like this:

Task     Name        Created    Status
52dbbd4  lyric-feel  01:21 PM   Running
6ef169b  kooky-taka  01:21 PM   Running
610131c  meaty-name  01:21 PM   Queued
3e6aab7  famed-life  01:21 PM   Queued

Worker status: 2 active, 0 idle

When all tasks are successfully calculated, the status will be Success. If there were any errors during the calculation, the corresponding status will be:

Task     Name        Created    Status
52dbbd4  lyric-feel  01:21 PM   Success
6ef169b  kooky-taka  01:21 PM   Success
610131c  meaty-name  01:21 PM   Success
3e6aab7  famed-life  01:21 PM   Success

Worker status: 0 active, 0 idle

If you need to look at the logs, just run dvc queue logs <name>Where <name> this is the unique name of the task. It is available as an environment variable DVC_EXP_NAME. Sometimes this is useful if an external logging system is used in parallel.

Experiments can be compared with each other. IN dvc.yaml a section with a metric—the proportion of correct answers (accuracy)—was indicated. It is shown during comparison and is a guide to which experiment turned out to be better. For a quick report, you can run the command:

dvc exp show --only-changed --sort-by accuracy --sort-order desc

If desired, you can save in other formats. When saving in Markdown there will be a table approximately (not all columns shown) as shown below:

Experiment

Created

accuracy

model._target_

model.penalty

model.solver

model.C

model.max_depth

workspace

0.87667

sklearn.linear_model.LogisticRegression

l1

liblinear

0.01

master

01:20 PM

0.92333

sklearn.linear_model.LogisticRegression

l1

liblinear

0.01

├── 998d475 [famed-life]

01:26 PM

0.97333

sklearn.tree.DecisionTreeClassifier

12

├── 266d93e [meaty-name]

01:26 PM

0.97

sklearn.tree.DecisionTreeClassifier

6

├── 7a03577 [lyric-feel]

01:26 PM

0.92333

sklearn.linear_model.LogisticRegression

l1

liblinear

0.01

└── c5f9a0b [kooky-taka]

01:26 PM

0.87

sklearn.linear_model.LogisticRegression

l1

liblinear

100

It contains experiments performed with sorting by accuracy value, as well as existing branches and the current working copy. If you actively use VS Code, you can install extension for DVC. You can view basic information, but in the interface.

Once an experiment is selected, you can quickly save it as a separate branch:

dvc exp branch wormy-yate branch-from-exp

After creating a branch, you can see that the new parameters are contained only in params.yaml and the difference will be like this:

diff --git a/params.yaml b/params.yaml
index 36234f8..7218b6e 100644
--- a/params.yaml
+++ b/params.yaml
@@ -1,8 +1,6 @@
 model:
-  _target_: sklearn.linear_model.LogisticRegression
-  penalty: l1
-  solver: liblinear
-  C: 0.01
+  _target_: sklearn.tree.DecisionTreeClassifier
+  max_depth: 6
 dataset:
   n_samples: 1000
   n_features: 10

The values ​​in the original configuration files, in configs, remain the same. This may cause errors if the configuration file is rebuilt rather than using the values ​​from params.yaml. You need to keep track of which files the stage is called with when you restart it. The author does not know of any simple methods, other than manually updating the original files so that the state is always up to date. In this case, the values ​​from configs can be considered as a starting point from where to take the entire structure, but the modified parameter values ​​are already stored in params.yaml.

To remove unnecessary information, you must do the following:

dvc exp remove -A
dvc queue remove --all

conclusions

This article looked at an example of organizing an ML project using DVC and Hydra to quickly change basic settings and launch local experiments with various parameters. This approach allows you to manage changes and the results of experiments. This makes the process more organized and structured, making it easier to record changes when successful.

There are certain problems with the need to adapt to the logic of DVC operation, maintaining the current state of all configuration files in the project and performance. We run a separate process each time to check the combination of parameters. In some cases this may not be very effective. In other cases, the costs are acceptable. For example, for projects involving training neural networks. Sometimes it is not possible to run several parallel trainings at once and a queue of tasks naturally forms. Using this approach, you can plan training with testing different ideas and leave everything to be calculated at night, and during the day check what results were obtained and what remains to be calculated.

DVC and Hydra have more functionality than has been discussed. For more information, please refer to the documentation. Perhaps readers will have their own version of a combination of these tools for better organization of work.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *