Down with randomness, or looking for the best settings for text augmentation

Hello everyone. Igor Buyanov, Senior Developer at MTS AI, is in touch. This post is a text version of my talk, which I gave last Friday at Pycon 2024. I will tell you how we optimized the parameters of augmentations for text data and what came out of it. The text is intended for a wide range of readers, so if you are hearing about augmentations for the first time, do not be afraid, we will figure it out.

What are augmentations and why are they needed?

The job of every ML engineer is to make their model better. To achieve this, you either need to work on the model or improve the quality and quantity of data. We will consider the second way. What problems can there be? Sometimes the data is so specific and rare that it is simply impossible to get a new portion, or the data is in the markup, and the model needs to be improved “right now”. Or maybe we are a small startup and simply do not have money for markup. What to do in such cases?

We can take an existing example from the dataset and change it a little bit, just a little bit, so that its connection to the class is not lost. Even if the changes are small, in fact we get a different example on which we can learn. This is called augmentation. Someone might ask, why do they work at all? Let's look at this picture.

I bet that the picture on the left is a typical one from well-known CV datasets like ImageNet. It is correct, beautiful, high-quality, and was definitely taken by a photographer in a studio. The problem arises when a model trained on such correct pictures starts working with photos of ordinary users. A tilted horizon, cropped photos, defocused images, and other defects are what the model will have to work with. But if the model has never seen that a cat can only be presented “half-sized”, then it can significantly decrease the confidence of its prediction, or even predict something completely wrong.

We can fix this if we ourselves, using a transformation in one line of code, make such “incorrect” photos as in the picture above on the right. In this way, we can reduce the impact of this effect only with the help of the computer.

In general, with augmentations we try to expand the diversity of our dataset and thus help the model generalize its knowledge about the nature of the data to better cope with the task.

What augmentations are there for text?

Above we looked at the CV example because it is super visual. Now let's look at what augmentations there are for text. I divided them into two groups: algorithmic and generative.

Algorithmic augmentations

This group includes those augmentations that we write with our hands. Often they perform one very simple action. Examples of such augmentations include the so-called easy data augmentation (eda), which will be discussed further in the post.

Another example could be typos. The idea is somewhat similar to the example with cats. If you work with texts that people type, especially on smartphones, then it is logical to expect that they will contain typos. For models, a word with a typo is a completely different word, so do not be surprised that it will behave strangely on them. But we can make typos ourselves. There are many ways: random replacement of symbols, emulation of a qwerty keyboard, or even imitation of the mistakes of the users themselves. There is a good library for the latter Sage.

Generative augmentations

The second group includes everything that uses any generative model. Back translation is a popular example. The idea is to translate the source text into another language, and then translate it back from that language. See how it works on the text of a classic:

“I remember a wonderful moment: you appeared before me” → “I remember a wonderful moment: you appeared before me” → “I remember a wonderful moment: you appeared before me”

As you can see, the meaning is preserved, but the texts are lexically different. In the era before transformers, there was such fun: run the text in a chain through 10 different languages and return it back to Russian. Sometimes the meaning was lost completely, sometimes partially, and sometimes this gave rise to funny texts. Now I have not been able to repeat it as well, translators have become much better than 10 years ago. On the other hand, I did not try for long.

Of course, any methods of augmenting a dataset using ChatGPT fall into this category. The peculiarity of language models is that you can directly synthesize entire datasets, but you need to remember that this is still not a solution to all problems, but a lot of work is being done in this direction.

Our case and idea

One of my tasks is to build and improve intent classifiers. These are the models that understand (or should, at least) what the user wants. We have hundreds of intents, and they have a natural data imbalance – there are super popular intents, and there are those that will have a mere few requests per month. We call such intents small or thin, the limit for them is 100 examples in the training sample.

Here's the problem with them. As a rule, these are specific requests of more general intents that have a large amount of data. Roughly speaking, their difference can be in some phrase or even word. The model, simply because it sees a more general intent much more often, starts to stuff all these specific intents into the general one. It occurred to me that if we increase the volume of small intents, they will be able to crowd out large intents and not dissolve in gradients.

As I thought about the problem, I had three thoughts in my head:

Augmentations usually have two parameters: the probability of triggering and/or the number of triggers.
By changing the augmentation parameters, we change the final data, and therefore the quality of the model.
Why not sequence augmentations to get more variety?

From these a common question emerged:

WHAT SETTINGS AND SEQUENCE OF AUGMENTATIONS WILL GIVE THE HIGHEST QUALITY MODEL?

Here the thought ran through my head that models have their own parameters that we tune to achieve the maximum possible output. They are not learned during model training, we set them ourselves, so we call them hyperparameters. Having drawn parallels, the answer came to mind:

LET'S REPRESENT AUGMENTATION AS A HYPERPARAMETER

The next question was how we would search for these hyperparameters. Three methods came to mind:

grid search – a complete enumeration of the entire search space. Reliable, but prohibitively long;
random search – just randomly select parameters and see what happens. Faster than grid search, but unpredictable;
Bayesian search — search using the Bayesian optimization framework. Trusting the mathematical apparatus, we can say that this is the best option. But, to be honest, I just wanted to feel it.

Okay, we've solved the search, but we have one small problem. Our training sample is so big that training one model takes 6 hours on four GPUs, and optimization requires dozens of iterations to get any kind of intelligible result. We would have waited for strong AI if we had used our production model. We decided to bypass this by using a proxy model in the form of logistic regression.

So, our general setup is as follows:

Proxy model for quality assessment is logistic regression.
We use the corporate BERT as a vectorizer.
The objective function for optimization is $1 - F1_{macro}$ . Such a strange view is needed for the optimization framework, it only works to find the minimum. We use macro because it sags more than micro and weighted.
The maximum number of steps is 100. Just because it’s 100.
Training is carried out only on small classes, and not on the entire dataset. This is also a proxy approach for the sake of training time.

What augmentations were used?

Now let's talk about the set of augmentations we used. Almost all of them are part of the EDA set, except for one custom one.

Token deletion

We go through the tokens and spin the random machine. If the random machine produces a number greater than the threshold, then we delete the token. We can do this several times. Example:

“How to me top up your SIM card account» → «How ___ top up SIM card account»

Token Reshuffling

Randomly select two tokens and swap them. We can do this several times. Example:

“How can I top up my SIM card account” → “Check I need to top up How SIM cards»

Replacement of synonym

We go through the words in the text and replace them with a synonym. We can do this several times. For this augmentation, an external resource is needed – a dictionary with sets of synonyms for each word. Such sets are called thesauruses. For the Russian language there is RuWordNet.

“How do I top up my SIM card account” → “How do I enrich SIM card account»

Inserting speech pauses

This is our custom augmentation. I work on a voice chatbot, which means my texts are recognized speech. Speech pauses are all sorts of “uh”, “aa”, “mm” and other sounds that people insert when they think along the way. We once noticed that the presence of such a pause can greatly confuse the model, so we decided to make it an augmentation. It works like this. We jump over the gaps between tokens and spin the random machine. If the number is greater than the threshold, then we randomly select a pause from the set of “uh”, “aa”, “mm” – they are the most frequent in the data – and insert it. We can do this several times.

“How can I top up my SIM card account” → “uh how can i top up my account mm SIM cards»

Token doubling

The same mechanics with a random machine, but this time we jump on the token and, if we're lucky, we simply double the token. We can also do this several times.

“How do I top up my SIM card account” → “how do I top up check check SIM cards»

Augmentation Sequences

Unfortunately, I didn't have enough time to code the sequence enumeration properly, so I had to manually select three sequences that seemed reasonable to enumerate. Here they are on the slide below:

It's really funny how they almost look like they were just shifted to the right by units.

Tools

I'll tell you a little about two tools that helped to implement all this.

NePS

This open library hyperparameter enumeration. I chose it solely because I came across it in one of the articles when I was studying the topic. You can choose any other framework that you are familiar with, for example, Optuna. Here is a code snippet that shows the main components for work:

We must define a function, inside which we do everything necessary to calculate the optimization function. The function (Python) must return the value of the optimization function. Next comes the dictionary, in which we define the search space of hyperparameters, and finally the launch itself.

textfab

This is my humble open librarywhere I initially collected all sorts of text preprocessing methods, because I got tired of writing these functions over and over again one day. Also, I wanted to have a way to log preprocessing, so all the factories — objects that do preprocessing — could be serialized as a yaml file, and then you could save them in dvc, wandb, clearml, etc. Later, I also added augmentations. Here's a snippet of code for understanding:

Full algorithm

The slide below shows the entire algorithm in assembly.

Here is a snippet of the code that implements this algorithm:

The augmentation bank is declared at the top, then the sequences go into permutations. Then everything is simple: the factory is assembled, the entire training dataset is run through it, which is then completely connected to the original. Finally, the model is trained and the result is calculated on the validation sample.

Results of the work

At the output of NePS we see a table where each line indicates the parameters that were selected, the operating time and the loss function (I did the sorting).

The final configuration turned out to be like this:

It is noteworthy that this is a sequence in which there is no token removal. To somehow evaluate with your eyes what the texts look like after this augmentation, here are some examples:

Now the most interesting part is the results of the model itself.

We compared the F1 macro of proxy model variants (logistic regression and BERT as an encoder) in three training cases:

On clean data of small classes. Rresult – 0.902;
On the worst case of small class augmentation. Result — 0.887 (-0.01);
On the best option of small class augmentation. Result — 0.933 (+0.03).

What does this tell us? Firstly, we have a chance to harm with augmentations, this is very important. Secondly, we managed to find an augmentation that gave an increase of as much as 3 points. Here I will make a remark that generally speaking, the closer we are to one in quality, the more difficult it is to improve the model. At such F1 values, three points is a lot. And all this is due to the computer and a relatively small amount of time. Yes, we are talking about a proxy model here, but if we imagine that this model is our target, then wow.

On the production model, after running small classes through the best augmentation, we got an increase of two points, which is also quite a lot.

Let's look at the statistics of the optimization results:

What we can say:

You would have to be very unlucky to get a negative result.
If you randomly selected augmentations, you would most likely get a 2 point increase (we are talking about a proxy model here, remember).
It seems that the search for hyperparameters was justified, because the best result was also a lucky one.

One could say that it was not worth bothering so much for the sake of one point. But in fact, you only need to bother once – to write the logic of the enumeration. Then you can set the experiments to be calculated overnight, and then take the best result. As I wrote above, at such quality levels, a difference of 1 point is a lot.

Which direction can you move in:

The simplest thing that catches the eye is the enumeration of sequences. It doesn't take long, but I didn't have enough time.
Conduct verification of the transfer of results from the proxy model to the real one – the fact that we got a 2-point increase on the production model does not mean that we were not just lucky. Ideally, we need to run a dozen experiments with different augmentations and look at the correlation between the quality of the production model and the proxy model.
Or even include a real model in the work, but in a simplified version. For example, you can freeze the encoder or use a distilled version of Bert. Yes, this is still not the production model itself, but the approximation is clearly better than logistic regression.
You can also expand the augmentations to your taste. We used algorithmic ones because of their simplicity, but no one prevents you from using generative ones.
For our domain RuWordNet is not a very suitable thesaurus, so you can try to create your own, even if it is not very clean.
You can also work with keywords so that you don't accidentally delete them or distort them too much, otherwise the model won't be able to rely on anything.

I can predict a couple of questions that might arise in your mind:

“So what's up with the small classes?” was a question I got during my test runs, but unfortunately the detailed metrics were lost.
“How many augmentations can be added?” – indeed, I wrote that I simply ran the entire set of texts with thin classes, but is this optimal? In general, this question cannot be answered so simply, there are entire works devoted to this. But no one prevents you from presenting this question as a search for hyperparameters.

That's all I have for now. I want to say a big thank you to my colleagues from MTS AI who helped me prepare for the presentation, and to the organizers of Pycon for the cool conference.

Yes, yes, now the link to my tg channelwhere I write more often than here.

Down with randomness, or looking for the best settings for text augmentation

What are augmentations and why are they needed?