Hyperparameter optimization in 5 seconds?

While people with computers waste time searching through the hyperparameters of neurons inside Scikit-learn libraries, real time management geniuses choose TPE and Optuna.

In this article, we will look at the most popular optimization methods Grid.Search and Random.Search, the principles of Bayesian/probabilistic optimization, and TPE. At the end, we wrote a small dictionary with functions, attributes and objects of the framework, and also gave a clear example of use.

Hyperparameter optimization is the basis for the neuron: Grid.Search and Random.Search

Hyperparameters are “big” settings, criteria by which the structure of the model and the way it is trained are determined. A comparison can be made with the same character customization, where the slider’s bias towards “thick physique” stretches hundreds of polygons on the character, the rectangles of which it consists, and completely changes its appearance. That's why they are called hyper-parameters, global, defining the entire system and the final result of its work.

For machine learning, selecting parameters is preparing the foundation from which the neuron will be built. Hyperparameters determine the degree of training and overtraining of the model. They are set before training, while the “parameters” or weights/coefficients are the results of the internal work of the neuron and are determined during its operation.

But in scientific terms, optimization of hyperparameters is finding a tuple in which the model would minimize the loss function in advance / loss function independent data: root mean square error, coefficient of determination in regression or, for example, cross-entropy in classification. Well, other functions.

A real mountain conglomerate.  There are many local minima and maxima.  Finding hyperparameters is complicated.

A real mountain conglomerate. There are many local minima and maxima. Finding hyperparameters is complicated.

Optimization is a kind of addition to the training itself, an additional phase of training. The performance metric/objective function can be visualized as a heat map or surface in n+1 dimensional space. As in the picture above.

Well, accordingly, the more uneven and unstable the surface, the more difficult it is to find the hyperparameters we need.

Hyperparameters can be the activation functions themselves, the functions applied to each neuron, or the number of epochs/passes (Number of epochs) through the data set. For a random forest: the number of trees, objects on the leaf, features for splitting the tree.

Each architecture has its own set of hyperparameters.

It is important to choose such a combination so that we get an adequate result at the output and at the same time do not waste all the electricity in the area, frying our render farm of 50 video cards.

The Scikit-learn library uses two methods for generating a selection of hyperparameters: GridSearch and RandomSearch.

GridSearch is the most proven, but “dumb” way to search for hyperparameters.

Brute-force method – for each hyperparameter, a set of values ​​is selected, which the model is trained with and evaluated based on validation data. You will get the ideal option at the end, but you will have to wait, to put it mildly.

Let's imagine a simple instruction for searching for hyperparameters in a randomized tree.

We need two parameters: the number of trees in the forest (n_estimators) and the maximum depth of each tree (max_depth).

We define a set of possible values ​​for each of these hyperparameters: n_estimators = [50, 100, 200] и max_depth = [10, 20, 30].

GridSearch then grids all possible combinations of these values (For example, [(50, 10), (50, 20), (50, 30), (100, 10), …, (200, 30)]).

For each hyperparameter combination, a random forest model is trained on the training dataset and then its performance is evaluated using a validation set or cross-validation set.

The best combination of numbers is selected. For example, if a model with parameters of 100 trees and a maximum depth of 20 levels gives the best performance, the hyperparameters are selected as optimal. Maximum quality and maximum time costs.

Another best option would be Random.Search. The name speaks for itself – a random selection of combinations in which you can lose the best options for the neuron. But the optimal ones are found within an acceptable period of time.

Imported import RandomizedSearchCV

After fetching a grid of parameters, for example, for the same randomized tree and initialization, we create an object:

random_search = RandomizedSearchCV (estimator=model, param_distributions=param_grid, n_iter=100, cv=5, verbose=2, random_state=42, n_jobs=-1)

Inside the random search itself, the model, grid, number of iterations and folds for cross-validation, verbose or the amount of information received during the operation of the algorithm are indicated.

On the other hand, no matter how the principle of working with grid search may seem “exhaustive” to you, often the random approach is more profitable. Details James Bergstra and Youshua Bengio talk about the benefits in their work. An option for those who work with a large number of parameters or high grid dimensions.

The fact is that not all hyperparameters are so important for training the model and loss function, so the chances of hitting the right points are often high.

We took the visualization from the article above.

In the second case, there are significantly more “hits” than in the example with grid search.

That's all the “most” popular methods of working with hyperparameters. But we have already noticed that they are two ends of the same stick. All of them offer rather crutch methods for selecting hyperparameters. Monitoring the effectiveness of iterating over values ​​is what we need. And since the selection of hyperparameters is a probabilistic thing and predicts the optimal direction of searching for a suitable combination of values, it is suitable for us to quickly select parameters

And today we are talking about a framework that will help optimize your model to the maximum and not spend several hours “parsing” the optimal set of values.

Bayesian statistics: a probabilistic approach to model optimization

Mathematics provides us with one predictive mechanism. If there is data after searching through a couple of combinations, it can provide information about where it is better to move in the subsequent search, which part of the grid to choose and obtain optimal sets of hyperparameters faster. This is where the Bayesian optimization method will help us.

It is enough to pre-calculate the degree of uncertainty and the expected structure of the relationship between the hyperparameters and the “target” function in order to get closer to the desired values.

We use the results in each iteration of the prediction accuracy and obtain the final, adequate result.

The goal of Bayesian optimization is to find the objective function.

A good comparison to Bayes' method would be interpolation. Suppose that part of the graph is not defined, we are given only some values ​​- we strive to patch up the uncertainty and restore the graph/function. It’s as if they were trying to reconstruct the route based on the car’s parking lots and its intermediate stops, or rather the probability that it went this way and not otherwise.

But here we aim to clarify the relationship between hyperparameters and an objective function, such as a loss function or metric, and explore the space using information about previous estimates of the objective function and then infer a subsequent set of hyperparameters.

To begin with, an a priori probabilistic model is constructed, which describes the initial understanding of the relationship between the hyperparameters and the target function. The more data we obtain about the objective function estimates, the better our knowledge of the hyperparameter space and its dependence on the objective function.

Roughly speaking, the Bayesian optimization method is a modernized random search.

This allows us to more accurately select the next sets of hyperparameters to evaluate and thereby improve the optimization process. Luckily there isn't much math involved.

The formula for selecting the next point in Bayesian optimization is:

Improvement function [improvement(x)] usually defined as the difference between the current best value of the objective function and the predicted value of the objective function for a given point x:

improvement(x)=max(0,best_value−predicted_value(x))

where: best_value is the best value of the objective function found at the moment, and predicted_value(x) – predicted value of the objective function for a point

Thus, Bayesian optimization selects the next point to evaluate by maximizing the expected improvement in the objective function. This approach allows you to efficiently explore hyperparameter space, focusing more on areas where the greatest improvement is expected and avoiding areas with low improvement potential.

A classic dilemma arises: exploration or promotion? Should we do another iteration to get more information or is it enough for us to determine the optimal hyperparameters?

The simplest method here is the Sequential Model-Based Optimization method, where a probabilistic surrogate model is trained using the data obtained from the objective function, and the selection function based on the predictive distribution of the surrogate model evaluates how useful it is to further conduct reconnaissance of the points we need… That is, Esploration vs. Exploation , obtaining new information or using old information.

Remember when we said that hyperparameter optimization is a tool for model training? Bayesian optimization strives to make it minimally computationally expensive, to minimize its size.

Based on this optimization method, quite a lot of other methods appear, but for us the most important approach is TPE, which is used in Optuna.

It differs only in the use of an additional Parzen window, where the “density” of probabilities is calculated, due to which even greater accuracy of predictions of qualitative hyperparameters is achieved.

In contrast to the Bayesian approach, when modeling the posterior probability distribution for hyperparameters, the Parzen estimation window is used and works as a method of approximation (approximation) to the desired data. Moving through iterations and reconnaissance of points, we find out where more “optimal” values ​​are located.

At the beginning of the TPE optimization process, an a priori model is built—a probability distribution for each hyperparameter in their space. This is where Parzen estimation comes in, which approximates the distribution based on previous estimates of the objective function.

Next, the prior distribution is split into two parts: one part corresponds to “successful” estimates of the objective function (i.e., sets of hyperparameters that gave good results), and the other part corresponds to “unsuccessful” estimates (sets of hyperparameters that gave bad results).

Next, unsuccessful hyperparameters are forced out and the search is, as it were, directed, i.e. the direction of further selection is chosen – the probability distribution function is applied to it. How likely is this set of hyperparameters to succeed?

After several iterations, we find a completely optimal set of hyperparameters and start training our neuron.

Optuna or Based framework for supporters of fast selection of hyperparameters

Today, TPE is used precisely in Optuna, and since the optimization process cannot be avoided, many users are gradually switching to this framework.

How to work with him? The framework provides not only TPE and the usual Random/Grid Search, but also “methods” for searching for hyperparameters based on the principles of genetic algorithms, which we will not mention in this article.

We will limit ourselves to the preinstalled TPE, which is quite enough to work with most neurons. Let's look at how to work with TPE in Optuna, because everyone comes to the framework specifically for this method. After you read our article, be sure to take a look at the documentation.

It is a little complicated, so below we will provide at least a partial dictionary of the functions and objects of the framework.

create_study: Creates a Study object for hyperparameter optimization.

study.optimize: Starts the hyperparameter optimization process.

study.best_params: Returns the best hyperparameters found.

study.best_value: Returns the best metric value found.

study.trials: Returns a list of all samples within an optimization study.

study.enqueue_trial: Adds a sample to the sample list, but does not execute it immediately.

study.remove_trial: Removes a sample from the sample list.

study.trial: Retrieves a sample by its identifier.

study.trial_callbacks: Sets the callbacks to be called before the start and after the probe completes.

study.set_user_attr: Sets the custom Study attribute.

study.user_attrs: Returns a dictionary of Study custom attributes.

study.load: Loads the Study state from the database.

study.save: Stores the Study state in the database.

delete_study: Deletes the Study object and all its data.

delete_all_study: Removes all studies from the database.

delete_trials: Removes samples from the database.

create_trial: Creates a sample and adds it to the database.

enable_pruning: Enables the pruning mechanism for samples in Study.

disable_pruning: Disables the sample trimming mechanism in Study.

get_pruned_trials: Returns probes that have been pruned.

delete_pruned_trials: Removes samples that have been pruned from the database.

study_name_exists: Checks whether a study with the given name exists.

get_storage: Gets the storage object used by the Study object.

And, of course, a list of objects built into the framework.

1. Study – optimized research. It contains information about the samples, their results and the optimization strategies used. It has a couple of its own attributes and a method:

optimize(func, n_trials, …) – method for running optimization with a certain number of trials.

best_params – an attribute containing the best hyperparameters found.

best_value – attribute containing the best metric value found.

trials – an attribute containing a list of all trials within the optimization study.

2. Trial – an “attempt” to optimization. It provides methods for selecting hyperparameter values ​​and logging the results. Four key methods.

suggest_categorical(name, choices) – method for selecting a categorical value.

suggest_uniform(name, low, high) – Method for selecting a value from a uniform distribution.

suggest_loguniform(name, low, high) – method for selecting a value from a lognormal distribution.

report(value, step) – method for recording sample results.

3. Sampler – setting up the selection of hyperparameters. Selects which combinations of hyperparameters will be tested at each iteration.

RandomSampler – Random sampling.

TPESampler – TPE Strategy.

GridSampler – Grid sampling.

Yes, these are the same methods of randomized search, using a grid and TPE technology. For us, it is the second option that is relevant.

4. FrozenTrial object – frozen attempt with information after iteration. It is used to analyze and visualize the results after optimization is completed.

5. StudySummary – description of the study. Provides a brief description of the study, containing its main characteristics, such as number of samples, strategies used, etc.

6. StudyDirection Enumeration defining the direction of optimization (minimization or maximization).

7. Summary – summary of the research. Provides a brief description of the study, containing its main characteristics, such as number of samples, strategies used, etc.

8. TrialState – attempt monitoring. Determines the status of the probe (for example, the probe is in progress, completed, or trimmed).

Yes, the framework is not that complicated and works almost automatically, although it often offers optimal solutions twice as fast as compared to the same Grid or Random Search.

Optimizing a neuron using a live example

At the micro level, optimization is simple. First you need to import the Optuna library:

import optuna

Next, determine the function you want to optimize.

def objective(trial): # Определение гиперпараметров для оптимизации 
  param1 = trial.suggest_float('param1', 0.0, 1.0) 
  param2 = trial.suggest_int('param2', 1, 100)  
  
  # Оценка производительности модели с использованием выбранных гиперпараметров 
  score = evaluate_model(param1, param2)  
  return score

Running Optimization: After defining the objective function, you can run the optimization process using TPE in Optuna:

study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler())
study.optimize(objective, n_trials=100)

Here direction='maximize' indicates that we are aiming to maximize the model's performance estimate, and n_trials=100 determines the number of optimization iterations.

Let's write the optimization conclusions separately on the screen.

print('Best parameters:', study.best_params)
print('Best score:', study.best_value)

This will allow you to know the best hyperparameters found as a result of optimization and immediately evaluate the performance.

Function create_study() creates a research object in which the optimization process is performed. Using the function optimize() you can start the optimization process for a given goal function.

Optimization results, including best hyperparameters and objective function values, are available through Object of Study methods such as study.best_params And study.best_value.

The Trial class represents a separate optimization iteration and provides methods for proposing different types of hyperparameter values.

Individual optimization iterations are accessible through the study object using the attribute study.trials.

Let's try to do all these steps using a specific model as an example.

We import libraries for our neuron and, of course, Optuna.

import optunafrom sklearn.model_selection 
import train_test_splitfrom sklearn.ensemble 
import RandomForestClassifierfrom sklearn.metrics 
import accuracy_score

Optuna – for optimization, functions train_test_split – for data separation, class RandomForestClassifier from the library scikit-learn – to create a randomized forest model, and accuracy_score – to evaluate the performance of the model.

Let's define the evaluation function:

def objective(trial): 
# Загружаем нужные нам данные 
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)  

# Выбираем оптимизируемые параметры 
n_estimators = trial.suggest_int('n_estimators', 10, 100) 
max_depth = trial.suggest_int('max_depth', 2, 32)  

# Обучаем модель 
model = RandomForestClassifier(n_estimators=n_estimators, 
                               max_depth=max_depth) model.fit(X_train, y_train)  

# Проводим валидационную оценку 
val_preds = model.predict(X_val) accuracy = accuracy_score(y_val, val_preds)  

return accuracy

This is the objective function we want to optimize. It accepts an object at the input trialwhich is used to select hyperparameter values.

The function loads data, determines hyperparameters (number of trees and maximum depth), trains the model RandomForestClassifier and returns the accuracy of the model on the validation dataset.

Let's create the Study object we need with the TPE optimization method:

study = optuna.create_study(direction='maximize', 
sampler=optuna.samplers.TPESampler())

Here we create a Study object specifying the optimization strategy 'maximize' to maximize the target metric (accuracy) and use TPESampler to use the TPE optimization strategy.

Let's start optimization!

study.optimize(objective, n_trials=100)

This code starts the optimization process using the method optimize(). We pass the function objective as an argument that will be optimized, and indicate the number of trials/attempts (n_trials)which Optuna must carry out.

Let's summarize.

best_params = study.best_params
best_accuracy = study.best_value

We get hyperparameters for our forest and their accuracy on validation data with attributes best_params And best_value object Study.

We remind you that the guys have their own documentation, which describes the entire vocabulary and applicability of the framework for models, read it before using it.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *