YandexGPT 2 is a major update to the Yandex language model

Today at the Practical ML Conf conference, a new version of our large YandexGPT 2 language model was presented. It is already working in Alice’s “Let’s think of it” skill, where it helps to structure information, generate ideas, write texts and much more. The new model answers better than the old one in 67% of cases, and in some scenarios it wins by an even greater margin. We achieved this result thanks to improvements at each stage of model training, but the key change is the new pretrain.

I will briefly talk about what has changed in the process of training the model, in which scenarios it has brought the greatest effect, and what we plan to do next.

In which scenarios is the new model particularly useful?

First, a few words about how models compare to each other. The same model can be strong in one scenario, but lose in another. How, then, to determine whether the model as a whole has become smarter or not?

We solved this problem in the following way: we collected 500 most dissimilar examples of user tasks. Then we gave them to the old and new models and counted how often the answer of the new model turned out to be better than the answer of the old one. If the new one wins more tasks, then we consider it smarter. YandexGPT 2 outperformed the previous version in 67% of cases.

With the evaluation of the model as a whole, everything is clear. But how does it behave in specific cuts of scenarios that are popular with users? To understand this, we divided the same 500 examples of tasks into separate groups corresponding to different scenarios, and measured how the quality of the model changed in each of them:

text generation – a victory in 69% of cases;

retelling and analysis of the text – 68%;

brainstorm of ideas — 66%;

text styling for the audience or character – 62%;

answers to questions – 62%.

Here are some examples for different scenarios:

YaGPT 2 example for text styling

An example of an answer to a question in the form of a table

Idea generation example

Text analysis example

Text generation example

What has changed in training the new model

There are two main stages of model training: pretrain and finetuning. At the first stage, the neural network increases its erudition and general knowledge about the world, language, and tasks, and at the second stage, it learns to fulfill requests, adhere to the format and style of the response. I have already discussed these steps in

about launching YaGPT in Alice. The main thing to remember is that the problem of one stage cannot be solved by improving the other.

In the story about the launch of the first model, we focused on collecting data for the fine tune. Now I’ll tell you more about the pretrain.

The task of the pretrain is to absorb all the useful knowledge of the Internet. The biggest challenge in this step is choosing from an endless stream of the most useful training data. But how to understand whether the dataset is getting better after each new piece of data? Retraining a large model for every dataset change and measuring its quality is incredibly time consuming and expensive. If we did that, we would be moving at a snail’s pace. A more realistic approach is to accumulate many changes in the dataset and only then retrain the model. But there is a significant risk of not guessing right with the chosen direction and getting a drop in quality instead of growth.

Previously, we reviewed the changes ourselves and even created tools for manually searching for information in the pretrain. Collecting the dataset was akin to art. And the better the dataset became, the more difficult it was to find problems manually. So we went the other way.

Now, with changes in the dataset, we train a small, fast model on it and compare it with the same model for the old version of the dataset. If the result is positive, then the change is accepted. So we spend resources only on those changes in the dataset that help improve the quality of the underlying model, which means that we can test a significantly larger number of hypotheses per unit of time. The key difficulty is that a small model does not fully reflect the properties of a large one. That is, this approach to measurements is not ideal, but in practice it is better with it than without it.

We were able to test a lot of ideas and accept among them those that are useful. Here are some of them:

We trained a classifier of low-quality text. Such text may contain encoding errors, HTML markup, repeated sentences, and the like.

We trained a useful text classifier. Text may look good, but be useless to the user. We consider useful texts that contain answers to real requests from Yandex Search users.

Increased the proportion of highly cited texts.

Improved deduplication algorithm: duplicates are less than 0.5%.

Created a separate tool for assessing “factual completeness”. We took real factual queries from the Search and measured what proportion of them can get an answer from the pretrain. We increased this share from 70% to 80%.

Where to try and what to expect in the future

The new model is already working in Alice in the “Let’s think of it” skill. It is available in Yandex Stations, TVs with Alice, the Yandex application, Yandex Browser, on the page with search results and on ya.ru. By the way, in Search, the chat window with the neural network can now be expanded to full screen for more convenient work.

What’s next? We will continue to improve the quality of the pretrain and fine tune datasets as we continue to see a good effect from quality examples. We have not yet implemented RLHF, but we are preparing for this step. And, of course, we will continue to implement YaGPT in Yandex services. But only where it will be useful.