DialoGPT in Russian

Hello everyone. At the end of 2019, one of the works on GPT-2 came out. Microsoft engineers trained the standard GPT-2 to conduct a dialogue. Then, after reading them article, I was very impressed and set myself the goal of teaching the same model, but in Russian.

Time passed, and a year later Sberbank did a very good job, putting out several small options in the public domain. models GPT-3 trained in Russian.

So, when all the stars converged, after spending a couple of months of nights constructing, processing and cleaning the dataset, I finally trained the GPT-3 medium model from Sber to conduct dialogues. Many thanks DevAlone, for creating a project Pikastatwithout which I would have spent years collecting data. Model trained 2 epochs on 32 GB of text on the library Transformers by Hugging Face. The training lasted 12 days on a 4x RTX 2080 Ti.

I express my gratitude to the company ICL Serviceswhich provided the computing power to train the model

The dataset consists of comment chains (~ 92% Pikabu, ~ 6% YouTube, ~ 2% VK). The number of samples for training is 41 million. The dataset did not specifically clear out profanity, so be prepared for “unexpected” responses.

Below are examples of dialogues with a trained model. Where the responses from GPT are long (over 50 tokens), I chose the best one among the three generated responses. There were also two attempts to build each dialogue. Otherwise, I did not change or tweak anything.

In the screenshots, the replica from GPT produced the entire model. Therefore, “all names and events are fictitious, any coincidences are accidental.”

See even more dialogs
View unsuccessful (common problems of dialog systems and models)

The average number of replicas during training is 4, so long dialogues (from ten or more replicas) are difficult for the model to perceive. Also, remember that the length of the model sequence is 256 tokens.

There was only a loss on validation. Therefore, in order to evaluate the quality of the model “by eye”, I assembled a web application. I’m not particularly friendly with JavaScript, but I have implemented the minimum functionality that I intended. The application repository lies here – there is also an instruction for launching.

The app consists of two parts:

  1. The site itself is in Flask (Python)

  2. Service generator on FastAPI (Python)

After starting the service and application, the interaction will look something like this:

The quality of the responses strongly depends on the generation parameters (here good introductory article explaining the parameters). When setting the generation of the desired length (“Length generate” in the block on the left), some parameters should also be changed for better generation. Now I am working on their optimization, and also working on a classifier that will select the most “successful” among the long answers.

Now let’s see in what form the string is encoded into the model:

Here in green 0 or one, this is speaker id. It shows which replica belongs to whom. The parameter responsible for the generation length is highlighted in red, which takes the following values: [ , 1, 2, 3]…

  • “-“ – means that we do not expect any specific generation length from the model

  • “1” – the model will try to generate a short answer. When trained, the range was up to 15 tokens

  • “2” – the model will try to generate an average response. When trained, the range was from 15 to 50 tokens

  • “3” – the model will try to generate a long answer. When trained, the range was from 50 to 256 tokens

In this example, we expect from the system a “long” answer to the question “What’s new?

I highly recommend talking to the model yourself. I didn’t create an open site for communication, since generation is a rather resource-intensive task. When I took off checkpoints and talked to her, I have not experienced such emotions for a long time: she makes me laugh, and makes me sad, and swears. In general, everything is as it should be for a modern youth models.

I also noticed that he answers philosophical questions well.

But there are also quite a few drawbacks in this version of the model: training on a small dataset (41 million, Microsoft had 147), generating poor quality long answers, issuing “vague” answers, undertraining of the model, poor “symbiosis” with Sber’s weights.

Model available on Hugging Face Model Hub… You can also download from Google drive

And finally: as a child I really liked the movie “I am a robot”. Let’s see how the model answers the questions from Detective Spooner:

It can be seen that Sunny’s level is still very far away. And this is good, since there is room for improvement.

For all questions and wishes, write to grossmend@gmail.com… In the future I will try to improve this model and post updated versions. Until next time!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *