what is it, why and how does it work

Some clients Voicebox want the bot to speak in a special voice. Well, if you need it, then you can, and soon adaptive synthesis will be available to everyone. In the meantime, we are experimenting with voices in test mode, and there is something I want to share with you in this article.

What is adaptive synthesis

Adaptive synthesis refers to a simple thing: voice generation based on presented speech samples. Anyone who wants the bot to speak in his voice writes down a certain number of phrases, on the basis of which the program will create a voice bot. Some of the words will be variables, that is, replaceable. And these variables, which will be used in speech, are synthesized by the program based on the recorded phrases. And so that the voice will sound almost indistinguishable.

Why is it important

The main problem with robots is that people do not want to communicate with them. Once most people realize they are talking to a robot, they hang up. One way out in this case is to simply dictate all the necessary phrases. But what if there are hundreds and thousands of these phrases? Already recording only names and patronymics will be delayed for a long time. But there are also amounts, goods and addresses, as in the previous example.

This is where adaptive AI comes to the rescue, generating variables in the same voice as the announcer. As a result, it becomes much more difficult to recognize the robot. The examples below will show you this. Therefore, the number of failures is significantly reduced. In addition, there is no need to record thousands of words and phrases, which is also beneficial for the customer.

Yes, the development of AI has led to the emergence of a wide range of text-to-speech (TTS) tools. But they are no longer an innovation. The maximum of TTS generators like Murf.ai, Beyondwords, Play.ht, Lyrebird AI, Lovo.ai, Speechify – to help in the development of voice assistants and voice acting of the text, but it is impossible to make it more humane by such means.

ChatGPT, DALL-E and VALL-E

After the boom in ChatGPT, some craftsmen made guides to add voiceover features to it. And, of course, it was worth waiting for the appearance VALL-E. This is a Microsoft tool for the same TTS, but which can imitate a human voice. As they say, it is enough for him to take a three-second recording of someone’s voice, and he is ready to reproduce it, turning any written words into speech with realistic intonation and emotions.

The service was announced in January, but it is not yet available for public use, although it is likely that it can be used to generate any text from any voice.

VALL-E is based on EnCodec technology, which was introduced in October 2022. Already on GitHub unofficial PyTorch implementation of VALL-E on this tokenizer.

Unlike other tools, VALL-E generates discrete audio codec codes based on phoneme prompts and acoustic codes. The technology can be combined with GPT-3. In essence, VALL-E analyzes the sound of a person, then EnCodec helps to break it down into discrete components (“tokens”) and, using training data, tries to represent other phrases with the same voice.

How it works for Voicebox

We are already working on implementing such solutions in business. For adaptive synthesis in Voicebox we have chosen a promising technology Brand Voice Call Center. Its advantage over others is that speech is generated in its entirety, it is not a gluing of pre-recorded templates and a variable part. And at the same time, she copes better, for example, with the generation of intonations, making speech more lively.

There is, of course, a small limitation. For synthesis, short texts are needed: phrases should be split up, try not to bring them to 24 seconds, and the length of the phrase should not exceed 250 characters, including the variable part.

For the robot in our case, a few phrases are enough. The SpeechKit Brand Voice model copies a voice from a template (an audio file in which the announcer says a certain phrase) and voices the variable part. The result is synthesized whole sentences instead of gluing the voice of the announcer and the standard voice of the robot.

Now let me show you how adaptive synthesis works, and then I’ll briefly explain why it’s so important for companies.

Examples

Let’s take a couple of our scenarios that we described in previous articles and see how the bot synthesizes variables. Synthesized records available via link.

Scenario #1. Secretary

We took the following phrases for scoring:

  • Здравствуйте, представьтесь, пожалуйста! Очень приятно, {name}! Вы хотите оставить сообщение для директора? Слушаю Вас, {name}, говорите!

Synthesized several options for a variable:

Scenario #2. Online store manager

We took the following phrases for scoring:

  • Добрый день, {name}! Вы оформили у нас заказ на сумму {amount} рублей. В заказ входят следующие товары: {order}.

  • Мы доставим заказ по адресу {address}. Спасибо, {name}! Заказ будет доставлен вам {day}, в {time}.

Synthesized the following options for several variables:

  • {name} 1) Илья Юрьевич 2) Андрей Петрович

  • {amount} 1) семь тысяч восемьсот 2) девятнадцать тысяч пятьсот

  • {order} 1) свитер 2) кеды

  • {address} 1) улица Ленина, дом один, квартира два, 2) улица Счастливая, дом пять, квартира двенадцать

  • {day} 1) первого апреля 2) пятого декабря

  • {time} 1) десять часов 2) двадцать часов

In the same way, the robot will be able to voice any other words and phrases recorded in variables in advance or during the dialogue.

Instead of a conclusion

So, what do we get using adaptive synthesis technology? Two important things:

  • Personalization. Adaptive speech synthesis allows you to generate phrases that are unique for each client. Thanks to this technology, the bot will be able to address everyone by their first name and patronymic, tell about the order, and name the delivery time. At the same time, a person will not always even notice that he is talking to a robot, which will make communication more natural and enjoyable and, of course, reduce the number of rejections.

  • Cost reduction. Adaptive speech synthesis can significantly reduce the cost of voicing content for a store or service. It allows you to quickly generate natural sounding, without restrictions on the number of directory positions, names, and so on. Let’s imagine how long it will take for the speaker to voice thousands of phrases, and with adaptive synthesis, it is enough to record several speech samples.

Thus, adaptive synthesis increases customer loyalty and helps to significantly reduce costs, especially time.

Author: Roman Andreev

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *