Text-to-speech. Analysis of open speech synthesis solutions

repository with pretrained models. The list of models is extremely extensive; the presence and number of models for languages ​​that use a specific Cyrillic alphabet, for example, Uzbek or Kalmyk, are amazing. Here is an example of generating Kalmyk speech.

In my half-Kalmyk opinion, it sounds great. You can hear the specificity of Kalmyk speech, and the voice itself is Kalmyk, whatever that means.

Silero supports generation at 3 sampling rates (8000, 24,000, 48,000). For comparison:

For the 4th version of the model, 5 voices are presented (aidar, baya, kseniya, xenia, eugene) and a random voice, which can be saved and used for further generation. In addition, Silero also provides an SSML model for stress and prosody placement.

Comparison of solutions

coqui(xtts)

silero

bark

peculiarities

voice generation based on reference

support for extended Cyrillic, generation of speech with an accent (models trained on the voices of representatives of various nationalities in Russia and the CIS)

generative model with the ability to generate non-speech sounds (laughter, etc.), as well as a large community

streaming

There is

No

No

support and documentation

site support has ended, there is a model on Hugging face

documentation site

community on Discord

model type

text-to-speech

text-to-speech

text-to-audio

sampling rates

24 kHz

8, 24, 48 kHz

24 kHz

built-in voices

based on reference + 58 votes

5 votes and random

10 and random

architecture

diffusion model based on GPT, Perceiver + HifiGAN

possibly the Tacotron family

GPT architecture

nature of generation

autoregressive

end-to-end

generation quality

good

good

bad

finishing

There is

No

No*

Vosk-tts

Model based on VITS. Generation from both the command line and the Python API. There are 5 Russian voices (3 female and 2 male). Generates at a sampling rate of 22 kHz. Generates relatively quickly.

As we hear, everything is fine with the model’s intonations. In addition, there is a detailed pipeline for fine tuning a model on a small amount of data. So if you have access to at least one video card, you can try to train a talker with your own voice.

Sova-tts

Solution based on the Tacotron 2 architecture. It is a REST API service, quickly installed. Minimalistic interface, you can change the speed, pitch and volume when generating. In theory, you can train and add your voices to the service. Generation examples:

Mimic3

Solution from Microsoft AI based VITS. The model for the Russian language is multilingual, 1 female and 2 male voices. Interesting thing: the solution can be implemented on Raspberry Pi. The generation quality for the Russian language is average. There is documentation, as well as the ability to add SSML markup to the text.

During generation, you can adjust the generation speed, etc. Example:

In general, the generator does not miss words and almost does not distort words, simply and tastefully. He makes mistakes in intonation and stress, and also cuts off sounds at the end of a sentence.

Clue: At the moment, models are not loaded via lfs, so you won’t be able to launch them right away. But if you spend a little time, you can find these models on Hugging Face.

Piper

One more thing solutionwhich can be implemented on Raspberry Pi from the developer of the previous project. For the Russian language there are 3 male and 1 female voices. Sampling frequency – 22,050 Hz. The generation quality is about the same as Mimic 3 (but does not cut off the ends of sentences in comparison).

SeamlessM4T-v2

Model generates the Russian language directly and implements the TTS pipeline for the Russian language by generating text from English industrial text with its further translation into Russian. Overall a very interesting implementation. In Russian, the quality is below average and, as it seemed to me, the accent is being transferred.

Sampling frequency – 16 kHz.

CONCLUSION

Voice generation models have many architectural implementations, each of which has its own pros and cons. Using models that have already been implemented by someone depends on their solution and its availability; pre-trained models will often be difficult or impossible to tune on your own. However, if the task is trivial and will not require future improvements in quality, speed, or additional specificity, then the projects presented above are potentially suitable for Russian speech generation tasks.

PS If you have good projects in mind for generating Russian speech that you did not find in this article, you can share links to them in the comments.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *