Hey! We recently presented a project “AI yes Pushkin”. Thanks to neural networks, the poet generates quatrains based on the first words that the user suggests. And then read aloud what happened. You can see and hear how this happens on the project website. And we will tell you how we solved the problem of the lack of rhyme with the help of controlled text generation and what technologies were used to make the project more effective.
How the product works
“AI yes Pushkin” is an experimental project. We wanted to show users and the market how AI technologies work at Tinkoff. And at the same time to test two new technologies on a large audience. Now the service works like this: the user comes up with the beginning of a poem, and the neuropoet gives out a quatrain with rhythm and rhyme, given the given context. The resulting poem “AI yes Pushkin” can be voiced. In this case, the user at all stages sees the animation of the poet’s face.
All this happens thanks to three Tinkoff technologies. A generative poetic model is responsible for creating quatrains, VoiceKit technology is responsible for their voicing, and Thara technology for mimic synthesis and portrait revival is responsible for animating the poet’s portrait. Our main task was to combine NLP and Thara technologies and get a full-fledged character with good speech synthesis and high-quality animation. In this text, we will talk in detail about the generation of poems. And then we’ll talk a little about how VoiceKit and Thara worked.
How to teach a model to rhyme if she doesn’t want to. We try two approaches and choose the best
The technologies of the generative dialogue model are already familiar to users of the Tinkoff mobile app. Previously, the application had a chat skill that worked based on it: we tested the need for such a function. The model was good at generating responses to users, but it couldn’t—and shouldn’t—compose poetry. We had to train her.
To do this, we have collected a dataset. In order for the model to have a large vocabulary and be able to navigate in today’s context, we included in the training sample not only Pushkin’s poems, but also works by contemporary authors. It turned out about 60 million poems collected from open sources.
Having trained the model on this dataset, we got an intermediate result that did not satisfy us. Firstly, we did not take into account some moments when we cleaned the dataset, and the model picked up something bad. She generated quatrains with obscene words, words in foreign languages and strange characters. Here are some examples of such poems:
I don’t know what will happen to us.
I don’t know how fate will turn out.
Secondly, in fact it could not be called poetry. We taught the model to take into account rhythm and rhyme according to a set of rules, and not just stressed vowels. But in the first iteration, she didn’t learn how to rhyme. Here is an example of a poem without rhyme:
And I’ll sit in a convertible and fly away.
Where? Don’t know. I know one thing
That I will be where no one is waiting for me.
And this place is called paradise.
We easily solved the first problem by cleaning the dataset, removing about 10 million unusable poems from there. Solving the problem of lack of rhyme turned out to be more difficult. The fact is that we did not penalize the model in any way for ignoring rhymes, so she had no incentive to create poetry. We conducted research, studied articles and existing methods, and developed two of our own approaches. Now let’s talk about both in detail.
First approach. You can learn more about the approach in the Recipes for building an open-domain chatbot article. It was invented to improve the quality of conversational bots. Initially, the approach was used to teach the bot not to fall out of the context of the dialogue with the user and to respond with more informative and useful remarks. We decided to modify the approach so that our model understands what rhyme is and creates rhyming lines. Let’s say in advance that in both approaches we used the GPT-2 neural network.
We divided the lines of poems into tokens and started to show rhyme patterns on purpose. At the beginning of each line, in 20% of cases we inserted the word that was at the end of the previous line. And in 80% of cases – a word that rhymes with the latter. So the model learned that the last word must rhyme with the first. In the process of inference, we ourselves were looking for rhymes and thus helped the model. Below are examples of poems with hints:
A storm covers the sky with mist,
Whirlwinds of snow twisting;
[Моет] Like a beast she will howl
A storm covers the sky with mist,
Whirlwinds of snow twisting;
[Завоет] Like a beast, she will howl
To make the approach work, we wrote a special rhyme search module. Right during the generation of the poem, the module gave the model options. The model could choose the rhyme we proposed, or it could offer its own based on ours. This approach lived for some time in the form of a Telegram bot that composed poetry and was liked by beta testers.
Second approach. As part of this approach, we trained the model to predict the next word using GPT-2 and provided information about the line number in this format: @LINE_11@@ the storm covers the sky with mist @LINE_22@@.
We then moved on to the problem of rhyme. We knew that if we generated only one line, the probability of getting a rhyme would be about 50%. And if you generate ten lines, then the probability that a rhyme will appear tends to 100%. Therefore, we decided to generate many lines and choose the best one from them – the one in which there is a rhyme.
At first, we thought to write a separate classifier to select the best option. But when we looked at the work of the model, we realized that this was not necessary. Rolling out another model would slow down the project. Now everything works on heuristics: they do an excellent job of choosing the best line and do not slow down the pipeline.
Comparison of approaches. Both approaches ensured the generation of poems. To summarize, in the first approach, control occurred due to the fact that we taught the model to focus on the rhyme that is given to it in each line using the rhyme search module. In the second approach, control occurred due to the fact that we noticed that among the ten generation options, a line with a rhyme is very common. Therefore, we, in fact, control the generation, each time choosing the desired line ourselves. To choose the optimal approach, we gave the generation results to Toloka. They showed people two quatrains and asked them to choose the best one. This is what the task looked like on Toloka.
As a result, the second approach won: on Toloka, most users liked the results obtained with its help.
VoiceKit and Thara: how does our neuropoet get a voice and a “live” portrait
We devoted most of this article to the controlled generation of poems, but the creation of “AI Yes Pushkin” as a product would not have been possible without speech synthesis and animation of the poet’s portrait.
The VoiceKit team gave the neuropoet a voice. VoiceKit is Tinkoff’s speech recognition and synthesis technologies. You can find out more about them Online. We went through the same path as any external client: we chose a voice that, as it seemed to us, suits the poet, and then we started uploading texts to the platform. For “AI yes Pushkin” we used ready-made solutions, so we will not describe in detail the principles of their work in this article. You can learn about how speech synthesis technologies work in Tinkoff from this video. If you still have questions, share them in the comments, and we will try to answer them in the following articles.
Thara is a technology that brought the poet’s appearance to life. This is a relatively young technology for Tinkoff. It was created by a team of experimental products of the same name, so we can’t talk about the technologies “under the hood” yet. But we will describe the approximate process of creating animation.
Thara can work with any appearance, so we started by creating a portrait of Pushkin. The illustrator received a detailed TOR, which we compiled taking into account all the technical features of the future product. Moreover, we gave the artist access to the technology so that he could check how the animation works as he works. The portrait was drawn for a long time: we wanted our Pushkin to be recognizable and modern. At the same time, he had to meet the expectations of users about what a neuropoet should look like. The result was several versions of the portrait. After collecting the opinions of users and voting within the team, we chose the final option.
Then Thara stepped in. The technology can be decomposed into two components: encoding audio and decoding it into a picture. There are similar solutions on the market, but most of them are tailored for specific characters. In addition, they do not allow you to do the synthesis quickly and at the same time with high quality. Many solutions have problems with movements: when turning the head, artifacts appear, the animation starts to look unnatural and sometimes scary. At Thara, we have been able to avoid this by refining the technology. The model well completes the ears, jaw and even hair in dynamics.
A separate task was the expressive facial expressions of the lips. It was important for us that the poet say exactly what the poetic model generated. This is how we achieved maximum synchronization of all components of the project: it was a great product challenge for the teams. We used a model that solves the LipSync problem, namely a modified version of the wav2lip network that functions faster than the usual one. We needed it to work with a minimum of artifacts in the area of the lips and teeth: without distorting the shape, color of the lips and skin texture. To do this, we did a lot of work on the correct rendering of the lips and teeth on the portrait of Pushkin, which we submit to the model for input. Now the process looks like this: first, we remove head and eye movements from a real person and transfer them to the portrait of our Pushkin using the First Order Motion Model. Then, with a modified version of wav2lip, we apply animated lips to the video for spoken speech, and from above we refine the texture and general style of the video with a modified version of the GPEN model.
In addition, we taught “AI yes Pushkin” emotions. To do this, the Thara team recorded several videos with the right emotions, such as rolling eyes or smiling. And the model was able to count them and adjust them to the facial expressions of the neuropoet.
All this happens quickly, the animation speed is 20-25 FPS (frames per second, frames per second). Now “AI yes Pushkin” works in HD resolution, while not taking up much space on the GPU.
As a result, we got a technologically interesting product, which was worked on by about 30 people from different teams. Hundreds of thousands of users have composed poems with the help of AI Yes Pushkin: on average, each of them has generated 3.7 poems. In total, more than 800 thousand poems were produced. Interestingly, users preferred to enter their own text as the beginning of a poem, rather than using our prompts. After generating a poem, they sent their friends a link to the project and screenshots of the results. This has become the main channel for attracting new users. From this we concluded that the audience liked the project.
Now we are preparing to introduce the technologies used in the creation of AI Yes Pushkin into other Tinkoff services and products. For example, controlled generation, which we discussed in detail in this article, will help improve the quality of the answers of financial assistant Oleg. And we will try to tell you more about the use of other technologies and their device in a blog in the future.