Text-to-speech. Analysis of open speech synthesis solutions
In my half-Kalmyk opinion, it sounds great. You can hear the specificity of Kalmyk speech, and the voice itself is Kalmyk, whatever that means.
Silero supports generation at 3 sampling rates (8000, 24,000, 48,000). For comparison:
For the 4th version of the model, 5 voices are presented (aidar, baya, kseniya, xenia, eugene) and a random voice, which can be saved and used for further generation. In addition, Silero also provides an SSML model for stress and prosody placement.
Comparison of solutions
coqui(xtts) | silero | bark | |
peculiarities | voice generation based on reference | support for extended Cyrillic, generation of speech with an accent (models trained on the voices of representatives of various nationalities in Russia and the CIS) | generative model with the ability to generate non-speech sounds (laughter, etc.), as well as a large community |
streaming | There is | No | No |
support and documentation | site support has ended, there is a model on Hugging face | documentation site | community on Discord |
model type | text-to-speech | text-to-speech | text-to-audio |
sampling rates | 24 kHz | 8, 24, 48 kHz | 24 kHz |
built-in voices | based on reference + 58 votes | 5 votes and random | 10 and random |
architecture | diffusion model based on GPT, Perceiver + HifiGAN | possibly the Tacotron family | GPT architecture |
nature of generation | autoregressive | – | end-to-end |
generation quality | good | good | bad |
finishing | There is | No | No* |
Vosk-tts
Model based on VITS. Generation from both the command line and the Python API. There are 5 Russian voices (3 female and 2 male). Generates at a sampling rate of 22 kHz. Generates relatively quickly.
As we hear, everything is fine with the model’s intonations. In addition, there is a detailed pipeline for fine tuning a model on a small amount of data. So if you have access to at least one video card, you can try to train a talker with your own voice.
Sova-tts
Solution based on the Tacotron 2 architecture. It is a REST API service, quickly installed. Minimalistic interface, you can change the speed, pitch and volume when generating. In theory, you can train and add your voices to the service. Generation examples:
Mimic3
Solution from Microsoft AI based VITS. The model for the Russian language is multilingual, 1 female and 2 male voices. Interesting thing: the solution can be implemented on Raspberry Pi. The generation quality for the Russian language is average. There is documentation, as well as the ability to add SSML markup to the text.
During generation, you can adjust the generation speed, etc. Example:
In general, the generator does not miss words and almost does not distort words, simply and tastefully. He makes mistakes in intonation and stress, and also cuts off sounds at the end of a sentence.
Clue: At the moment, models are not loaded via lfs, so you won’t be able to launch them right away. But if you spend a little time, you can find these models on Hugging Face.
Piper
One more thing solutionwhich can be implemented on Raspberry Pi from the developer of the previous project. For the Russian language there are 3 male and 1 female voices. Sampling frequency – 22,050 Hz. The generation quality is about the same as Mimic 3 (but does not cut off the ends of sentences in comparison).
SeamlessM4T-v2
Model generates the Russian language directly and implements the TTS pipeline for the Russian language by generating text from English industrial text with its further translation into Russian. Overall a very interesting implementation. In Russian, the quality is below average and, as it seemed to me, the accent is being transferred.
Sampling frequency – 16 kHz.
CONCLUSION
Voice generation models have many architectural implementations, each of which has its own pros and cons. Using models that have already been implemented by someone depends on their solution and its availability; pre-trained models will often be difficult or impossible to tune on your own. However, if the task is trivial and will not require future improvements in quality, speed, or additional specificity, then the projects presented above are potentially suitable for Russian speech generation tasks.
PS If you have good projects in mind for generating Russian speech that you did not find in this article, you can share links to them in the comments.