How selective forgetting helps AI learning

Removing certain information during training helps machine learning models learn new languages faster and better

A group of computer scientists came up with more flexible machine learning model. What is special: the model must periodically forget some of what it knows. The new approach won't replace huge models, but it may tell us exactly how they understand natural language.

The Problem of Language Models

Nowadays natural language processing is most often carried out using neural networks. Each “neuron” in the network is a mathematical function that receives signals from other neurons, performs calculations, and passes the signals on through several neural layers. At first the flow of information is more or less chaotic. During the learning process, the network adapts to the data, the information flow between neurons is ordered and improved.

Let's say an AI researcher wants to create a bilingual model for translating from German to Chinese. To do this, he will train the model on large amounts of text in both languages. Training will build neural connections so that the model learns to correlate text in one language with suitable words in another.

Such training requires enormous computing power. If the model does not work very well or the user's needs have changed, then adapting the system will be quite difficult.

“Let's say you have a model that includes 100 languages. But the language you needed was not there. You can start learning from scratch. However, this is not the best option,” says Mikel Artetcheco-author of the new study and founder of AI startup Reka.

How to teach a language model to “forget”

Artetche and his colleagues tried to get around this limitation. A few years ago They trained a neural network in one language and then removed everything it knew about the structural components of words (or tokens). These tokens are stored in the first layer of the neural network – the Embedding layer of the vector representation. The remaining layers in the model remained unchanged. The researchers retrained the model in a different language, and this new language populated the Embedding layer with new tokens.

The retraining bore fruit: the model was able to learn and process a new language. The researchers suggested that the vector representation layer stores data about the words used in a language, while the deeper layers store abstract information about the concepts of human language. This is what helps the model learn a second language.

“In every language we call the same things differently, but we live in the same world. This is why high-level reasoning mechanisms appear in the model. “An apple is something juicy and sweet, and not just a word,” explains Yi Hong Chenlead author of the recent study.

Although the forget-by method was able to effectively add a new language to an already trained model, retraining still required significant costs: a large amount of linguistic data and computing power. Chen suggested a little trick. Instead of training, removing the vector representation layer and then retraining, you can periodically reset the Embedding layer in the early stages of training.

“So the whole model gets used to reboots. If you want to add new language to the model, it will be much easier because the model is already accustomed to this behavior,” explains Artetche.

Forgetting model tests

The researchers took a widely used language model Roberta and trained her using the periodic forgetting method. We then compared the performance of the same model trained in the standard way without forgetting. The “forgetting” model turned out to be slightly worse than the classical one and scored 85.1 points on the general criterion of linguistic accuracy (for comparison: the standard model scored 86.1 points).

They then retrained the models in other languages using smaller data sets: just five million tokens instead of 70 billion. The accuracy of the standard model dropped to an average of 53.3, while the forgetting model only dropped to 62.7.

Additionally, the forgetting model performed much better when the team added computational constraints during retraining. When the study reduced the training length from 125,000 steps to 5,000, the accuracy of the forgetting model dropped to an average of 57.8, and the classical model dropped to 37.2, that is, it turned out to be no better than random guessing.

Why forgetting models learn better

The researchers suggested that if language models understand language, they do so at a deeper level than just remembering individual words. The human brain uses the same approach.

“Human memory in general is not suitable for storing large amounts of accurate information. People, on the contrary, tend to remember the main meaning of what is happening through abstraction and extrapolation. One way to get flexible performance from AI is to add more human-like processes (such as adaptive forgetting) to the model,” explains Benjamin Levya neuroscientist at the University of San Francisco.

Artetche hopes that more flexible forgetting language models can not only tell us about how understanding works, but also help spread the latest innovations in AI to more languages. The AI models work great with English and Spanish, two languages with plenty of educational material, but not so well with its native Basque, a regional language common in northeastern Spain.

“Most models from big tech companies do a poor job of this. Adapting existing models to the Basque language is the best possible option,” admits the researcher.

And Hong Chen also looks forward to a world with more AI diversity.

“I imagine a situation in which the world no longer needs one big language model. After all, we have so many of them. And if there is a factory that creates language models, then such technology will be useful to it. That is, there will be one basic model that can quickly adapt to new subject areas,” she says.

Source

The material was prepared as part of the start of accepting applications for the first in Russia online master's degree in Data Science in the field of NLP (natural language processing) from TSU and Skillfactory.
The master's program partner is Yandex Dialogues, a platform for developers that allows you to create voice applications for Alice and the Yandex Smart Home.