Synthetic Minority Oversampling Technique

In data science, everyone already knows about the importance of data for the success of any machine learning project. It often happens that the data itself is much more valuable than the model that was trained on it, since the process of obtaining this data can be much more difficult, more dangerous, more expensive than training the model. Therefore, the generation of data sets is gaining popularity, special frameworks are being created. Today we will talk about one of such frameworks, SMOTEor Synthetic Minority Oversampling Technique. Quite a lot of material has accumulated on this technique over the last two decades. The key difference of this article is in the experiments that were conducted during the study of the performance of this type oversampling.

Statement of the problem

Anyone who has ever encountered machine learning is familiar with the concept of «lack of class balance». It is rare to find balanced data sets, unless we make the necessary balanced sample ourselves. Also, many have probably heard that class imbalance can negatively affect model training, so questions always arise about how to solve such a problem.

Sample with and without class balance

Sample with and without class balance

There are several ways to supplement data, but today we will focus on correcting the imbalance using “synthetics”, that is, artificial generation of additional samples, taking into account the nature of the data. There are two ways to align classes: oversampling and undersampling. Oversampling – we supplement a poorly represented class. Undersampling – we cut off the dominant class and thereby equalize the class imbalance. Specifically, SMOTE is an oversampling technique, that is, it supplements a poorly represented class to correct the imbalance in the data.

An example of oversampling and undersampling in action

An example of oversampling and undersampling in action

Classic SMOTE

There are quite a few types of SMOTE. This article will describe the 3 main ones, and then move on to problems and experiments. Let's start with the classic one. The process of creating synthetic examples is quite simple and is based on the nearest neighbor algorithm.

We take a sample and one of its neighbors, chosen at random. We then interpolate between these two points. The result is a new sample that ends up somewhere between them in the feature space, inheriting the properties of both points. This interpolation allows us to generate new data that preserves the essential characteristics of a small class, but adds a bit of variety.

An example of the classic SMOTE in action

An example of the classic SMOTE in action

This is what it will look like after the new data is generated:

The result of the classic SMOTE

Borderline-SMOTE

Borderline-SMOTE targets samples located at the boundary between different classes. The idea is that these samples are often the most difficult to classify and therefore the most important for model training. It first identifies the boundary samples of a small class, i.e. those that are surrounded by many neighbors from the dominant class. Borderline-SMOTE is then applied exclusively to these boundary samples, increasing their number and thus strengthening the separating boundary around them.

Example of Borderline-SMOTE in action

Mathematical description of how Borderline-SMOTE works:

SVM-SMOTE

SVM-SMOTE uses a support vector machine to select samples to be boosted. SVM is known for its effectiveness in finding optimal dividing boundaries between classes. We first train an SVM classifier. Then, the samples that are closest to the dividing boundary (support vectors) are used to create new synthetic data.

Example of SVM-SMOTE operation

Mathematical description of the operation of SVM-SMOTE:

ADASYN – Adaptive Synthetic Algorithm

ADASYN is not actually a type of SMOTE, but it is very similar. It focuses on the samples that are most difficult to classify and creates more synthetic data for these samples. The amount of synthetic data for each sample is determined based on the number of neighbors from the opposite class – the more neighbors from the other class, the more synthetic data will be created.

 Example of ADASYN operation

Mathematical description of how ADASYN works:

To watch or not to watch?

In May 22, an article was published To SMOTE or not to SMOTE. In it, the authors decided to test the performance of this technique on different machine learning models. The study showed that only the so-called weak learners, i.e. simple models such as a tree, perceptron, or support vector machine, benefited from such oversampling. The most advanced models “by type” (xgboost, light gbm, cat boost) received a rather small or zero increase in metrics.

The metrics are presented below, here we are interested in the purple circles. The metrics show that only weak models showed an increase. While the best ML models almost did not respond to the use of SMOTE. These circles are the average metric for all datasets in the study, a total of 73 datasets were used in the assessment.

Metrics on weak and SOTA models
Metrics on weak and SOTA models

Explanation of the drawings:

  • SVM – Support Vector Machines

  • DT – Decision Tree

  • LGBM – Light Gradient-Boosting Machine

  • XGBoost – Extreme Gradient-Boosting

  • CatBoost – Categorical Boosting

  • MLP – Multilayer Perceptron

Using MLP as an example, I figured out what the labels on the chart mean:

  • Rank – rating of all used combinations of models and SMOTE, where the lower the better.

  • MLP+Default – model with initial hyperparameters without oversampling

  • MLP is the model with the best hyperparameters without oversampling. Next come the models with oversampling.

  • MLP+Random – a model with the best hyperparameters and using Random Oversampling (random duplication of examples for a poorly represented class).

  • MLP+SMOTE, MLP+SVM-SM, MLP+ADASYN – models with the best hyperparameters using different types of oversampling that I described above.

  • MLP+Poly – a model with the best hyperparameters and the use of Polynomfit SMOTE (generation of synthetics by fitting a polynomial to the feature space of a small class).

Setting up your own experiments

After reading this article, I decided to verify the results of the article myself and run at least several classifiers and variations of SMOTE on one dataset and look at results.

Experimental conditions:

  • Classic Wine Dataset (definition of wine quality from 1 to 10)

  • Removed one categorical column (for SMOTE it is desirable that all features are continuous, it does not know how to work correctly with categorical values)

  • Undersampling was not performed.

  • Removed wines with a score of 9 (there were only 5 of them and SMOTE could not generate data for such a small number of samples)

  • 20 experiments were conducted (4 models and 3 types of SMOTE, 4 times the model was run without SMOTE, 4 more times ADASYN was used)

  • The test set consists only of real samples to evaluate the real performance of the classifiers.

  • No hyperparameter tuning was performed, all models were trained with default settings

Experiments – Gradient Boosting

So, let's start with gradient boosting. We consider it one of the advanced algorithms, so we expect that there will be a deterioration or the same quality of predictions. If you look at the metrics quickly, you can see that they are almost the same as with default boosting. With different types of SMOTE, different classes break ahead, but the overall picture does not change, everything is more or less the same.

Gradient Boosted Metrics

Experiments – Random Forest

Random forest is also a pretty strong model, so we expect results like gradient boosting. And so it turned out, if we take a quick look at the metrics again, the same picture, all +- as with the default random forest. As in the previous slide, in different SMOTEs, different classes come out ahead.

Metrics on random forest

Experiments – Multilayer Perceptron

This is where things get interesting. In the 2022 study, perceptrons with SMOTE were significantly ahead in metrics compared to the default perceptron. What is visible: metrics began to grow on absolutely zero classes. That is, for example, 3 and 8, although there was a general decline in metrics on other classes. SVM-SMOTE showed the best performance here. If we look at f1, this is visible when compared to other types of SMOTE.

Metrics on the perceptron

Experiments – Support Vector Machines

SVM is also quite interesting. On default it was possible to detect only 2 classes out of 6. The same situation as with perceptron, SVM began to see classes that had zero metrics. And again SVM-SMOTE showed itself to be the best, together with the main SVM classifier. There it became possible to find 4 classes.

Support Vector Machine Metrics

When do we use it?

  • application of lightweight models for simple classification

  • I also saw in another article that the calibration of the model probabilities breaks down, so this also needs to be taken into account

When don't we use it?

  • at high dimensionality (In this case, the distance between points becomes less distinguishable. The concept of the nearest neighbor begins to lose its significance. There is a risk of creating non-representative samples.)

  • a lot of “noise” in the data (A large number of noisy samples will also generate a fairly large amount of anomalous data.)

  • distribution issues (Distribution issues can result in samples being created that do not represent the true distribution of the class.)

  • presence of categorical features (If categorical features are present, SMOTE may create incorrect samples.)


Author of the article: Nikolay Chudinovskikh, Junior Researcher at the UDV Group Research Center


Useful links:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *