What is TTS technology, how does it work and in what areas is speech synthesis used?

Speech synthesis is a technology that converts written text into an audio signal. The program analyzes words and creates sounds that imitate the human voice.

The method is called differently: speech generation, Text-to-Voice (T2V), Text-to-Speech (TTS), but the essence remains the same. The main purpose of TTS is to voice text so that it can be listened to instead of read.

Together with Grigory Sterling, the leader of the TTS team at SberDevices, we will understand how the technology works, how speech synthesizers are developed and what you need to know to work in this area.

How speech synthesis developed

The technology dates back to the 18th century, when the Austro-Hungarian mechanical scientist Wolfgang von Kempelen created “talking car”. It was a system of bellows, tubes and membranes that produced sounds reminiscent of speech.

An exact replica of a “talking machine” created in 2007–2009 at Saarland University in Germany. Source

An exact replica of a “talking machine” created in 2007–2009 at Saarland University in Germany. Source

In 1937, engineer Homer Dudley of Bell Labs developed Voder – the first electronic device for speech synthesis based on electrical signals.

By the 1960s, TTS began to move into the digital age. MIT made one of the first computer systems – DECtalkbased on digital methods. However, due to limited computing power, the synthesized speech sounded artificial. Only in the 1990s, with the advent of more powerful computers, was it possible to improve the quality of voices and make them closer to human speech.

Two main approaches were used for speech synthesis:

  1. Concatenative synthesis. Based on connecting fragments of recorded speech (for example, individual sounds, syllables or words). They were “glued together” in the right order to create entire words and sentences.

    Although this method produced a more natural sound than earlier approaches, it had disadvantages: unnatural transitions between audio segments and high data acquisition costs.

    This type of synthesis began to become obsolete with the advent of neural network methods in the mid-2010s.

For example, we have a recording where the announcer says “Hello.” This audio track is divided into pieces – sounds from which the word “Wind” can be composed. The more data in our sound catalog and the more appropriate pieces of sounds are inserted into the right place, the better the result will be.

Grigory Sterling, leader of the TTS team at SberDevices

  1. Parametric synthesis. The synthesis problem is solved in two stages: an acoustic model and a vocoder. First, some parameters of speech are predicted from the text – most often spectrogram. It is then turned into sound using another model, the vocoder.

    For a long time, this method created “droning” and “ringing” sound tracks. But in 2016, the generative model WaveNetwhich used a parametric approach, was able to synthesize speech that is as close to human as possible.

Neural networks have finally replaced old methods of speech synthesis. Although the process was quite slow in the early stages, they were best at conveying natural intonations.

New models such as ChatGPT-4oare able to imitate “humane” actions, for example, humming tunes and cooing with dogs.

Grigory Sterling, leader of the TTS team at SberDevices

What are the capabilities of TTS technology?

Speech synthesis is not limited to simply converting text into audio. There are various tasks that enhance the capabilities of this technology, such as:

  • voice cloning – using a short voice sample, which can be of any length, up to a few seconds;

  • emotional synthesis – with the addition of different emotional shades and styles;

  • multilingual synthesis – taking into account the characteristics of each language: grammar, intonation, stress, as well as complex phonetic rules;

  • dialect synthesis – taking into account regional characteristics, accents and dialects. This can be useful for creating more authentic voices, such as British, Australian or American accents;

  • personalized synthesis – creating a voice for specific needs and preferences. Unlike cloning, this synthesis can be based not on a sample of a specific person, but on pre-selected characteristics of the voice (timbre, intonation, speed of speech). He is often used to voice various characters in audiobooks;

  • synthesis of whisper or scream – a separate task of synthesis, when speech should be quiet (for example, whisper) or, conversely, louder and more confident. This requires special processing, since not only the acoustic parameters change, but also the intonation features.

Where is speech synthesis used?

Here are some of the main areas in which TTS is used:

  • Virtual assistants: for example, Salute from Sber, Siri from Apple, Alice from Yandex, which voice responses to the user.

  • Voice acting audiobooks

  • Navigation assistance in GPS systems.

  • Inclusive technologies for people with visual impairments (voiceover of interfaces, screen readers).

  • Contact centers and voice bots to automate communication with clients.

  • Videos, advertising and games, where you need to do the voice acting of the lines.

With the help of synthesis technologies, you can create not only speech, but also music. For example, a startup Suno AI generates music in different genres such as rap and pop. An example of a Russian development – model SymFormer for generating musical works, developed by SberDevices. She studied on a dataset of 160 thousand compositions: from classical to modern electronic music and rock.

There are many examples of neural network music on YouTube and TikTok. Speech and music synthesis technologies are similar because both use neural networks to convert discrete elements (letters or notes) into sound. However, each case has its own specific approaches. But the specifics of the tasks are still different: in speech synthesis it is important to take into account the biology of the vocal tract, and in music – the theory and structure of compositions.

Grigory Sterling, leader of the TTS team at SberDevices

How does the speech synthesis process work?

The process includes several stages; let’s look at each of them in more detail.

Stage 1 – Text Normalization

The text that is fed into the model must be brought into a normalized form, that is, numbers and abbreviations must be revealed, as well as emphasis placed on all words.

  • Normalization. Numbers and abbreviations written in the usual form are revealed in full letters, as they would be pronounced by voice. For example, “I was born in 1992 in St. Petersburg” is transformed into “I was born in nineteen ninety-two in the city of St. Petersburg.” Usually this problem is solved by rules, but modern NLP technologies make it possible to use neural networks.

  • Placing accents. All words in a sentence must be stressed. Usually phonetic dictionaries are used for this, but for rare or new words a separate model is needed. The Russian language also has homographs – words that are spelled the same, but depending on the context they have different accents. For example: “when you look out the window” and “look out the window.” There are open Python libraries for setting accents, for example ruaccent.

  • Some TTS engines require text phonemicization, that is, writing letters in audio form. For example, “January” → “y a n in A r'”. The phonetic alphabet is a term from linguistics, and there is no single established one. It is recommended to use IPA – international phonetic alphabet – and the open-source phonemizers running in it.

  • Modeling prosody. Prosody is the intonation with which a text is pronounced. For example, what words are given logical emphasis (emphasis) or pauses in speech. Speech synthesis will be of higher quality and more controlled if a separate NLP model is responsible for setting these or other parameters.

Stage 2 – Converting Elements via a Text Encoder

Encoder – it is a component of a machine learning model that transforms input data (in this case text) into embedding – a numerical representation that preserves the meaning and context of information.

It contains data about voice, emotional context, stress.

Methods:

  • Recurrent neural networks (RNN). Models process text sequentially, passing information about previous elements at each step. This helps to better maintain context, especially in long texts.

    Standard RNNs can lose important connections over large intervals, so improved versions are used, such as long short term memory (LSTM) and managed recurrent blocks (GRU).

    An example of the application of such networks is architecture Tacotron 2which uses sequential processing for speech synthesis.

  • Convolutional Neural Networks (CNN). CNNs are used in architectures like WaveNet to handle local dependencies such as phonemes and words. And they highlight important features of the text for further speech synthesis.

  • Transformers. Models such as BERTuse attention mechanism (self-attention) for parallel text processing. This way they can take into account both local and global dependencies in the data – this speeds up training and speech synthesis compared to RNNs.

Stage 3 – Duration Prediction

Duration Predictor – it is a component of speech synthesis models that is responsible for predicting the timing of each phoneme in text. Essentially, it “decides” how long each sound needs to be pronounced so that speech sounds natural.

The sequence of numbers (representation of phonemes) received from the encoder is “stretched” according to the predicted duration of each phoneme.

A speech synthesis process where the model predicts the duration of each sound: first the encoder creates an intermediate representation, then the Duration model determines how many frames are required for each character and repeats this representation the required number of times. Source

A speech synthesis process where the model predicts the duration of each sound: first the encoder creates an intermediate representation, then the duration prediction module (Duration model) determines how many frames are required for each character and repeats this representation the required number of times. Source

This module is needed only for non-autoregressive models (FastSpeech, Vits), and not needed for autoregressive ones (Tacotron 2, Valle). In Tacotron 2, the duration of phonemes is often controlled directly through the attention mechanism – the model decides when to move from one phoneme to another based on context.

Methods:

  • Transformers. Models such as FastSpeechpredict the duration of phonemes using self-attention mechanisms. IN FastSpeech 2, In addition to duration, pitch and energy (loudness and intensity) are also predicted. Thanks to this, speech becomes more expressive and natural.

Stage 4 – generating a visual representation of the sound using a decoder

Decoder – this is part of the neural network that converts embeddings into a chalk spectrogram. Chalk spectrogram – it is a visual representation of an audio signal, showing its frequencies over time. It serves as the basis for subsequent conversion into an audio signal.

Mel spectrogram of a melody, built using the melspectrogram() function from the librosa library (Python)

Methods: The same layers are used for decoding as for encoding: CNN, RNN and Transformers.

Step 5 – Generating Sound Using a Vocoder

Vocoder – This is an algorithm that converts the chalk spectrogram received from the decoder into a final audio signal. This is the last step in the speech synthesis process.

Methods:

  • Flow-based models. For example, model WaveGlow uses flow-based (streaming) programming to convert chalk spectrograms into high quality audio signals.

  • Generative Adversarial Networks (GANs). Architectures such as MelGAN And HiFi-GANuse GAN– a framework in which one model generates sound, and the other evaluates its quality.

Recently, it has become fashionable to create end2end models in which all modules are connected and learn together. This improves quality, but requires more computing resources.

What tools and technologies are used for speech synthesis?

Speech synthesizers are systems consisting of an encoder, decoder and vocoder. Some models cover the entire synthesis cycle, while others focus only on certain stages – acoustic processing or vocoding. Let's figure out how they differ.

Models for the full synthesis cycle

  1. VITS (Variational Inference Text-to-Speech). Uses variational autocoding And GAN for simultaneous generation of an audio signal.

  2. GradTTS. One of the first models to use denoising diffusion to the problem of speech synthesis. This is an approach in which a heavily noisy image is converted into a clean chalk spectrogram by repeated application of a “denoising” neural network. At the end it is voiced with a vocoder.

  3. Tacotron 2 + WaveGlow. The text input is passed to the Tacotron 2 model. It encodes the text and converts it into embedding. The Tacotron 2 decoder generates a chalk spectrogram from this representation. WaveGlow speaks the spectrogram, converting it into an audio signal.

Models for individual synthesis steps

  1. FastSpeech 2. Performs encoding and decoding. Also uses variations adapter, which adds additional information to the embedding: duration, pitch and energy.

    Application: encoding, decoding.

  2. MelGAN. Consists of generator And discriminator which learn together: the generator creates audio by trying to fool the discriminator, which learns to distinguish fakes from real audio signals.

    Application: vocoding.

  3. HiFi-GAN. Uses two types discriminators to evaluate and improve the quality of generated audio:

  • multiscale, which analyzes audio signals at different time scales;

  • multi-period, which focuses on periodic aspects of sound such as harmonics (unique timbre and emphasis) and cyclical patterns (rhythmic and tonal repetitions).

At the moment, there is no one universal model for speech synthesis. Parametric synthesis algorithm Tacotron 2popular since 2017, uses an acoustic model and a vocoder to convert text to speech. However, research groups are developing other approaches, such as VITS, GradTTS, FastSpeech, NaturalSpeech 2, MatchaTTS And VALL-Eeach of which has its own characteristics and advantages.

Grigory Sterling, leader of the TTS team at SberDevices

How to develop speech synthesis

To write code for speech synthesis systems, the easiest way is to take open-source implementations of well-known architectures (FastSpeech, Tacotron 2 etc.).

To train models, pairs of text and corresponding audio are needed. These recordings must be made by professional announcers with the correct intonation and style. In commercial applications, you need to use licensed datasets or create your own.

In developing systems for synthesizing and adapting data and models for specific purposes, they usually use Python:

  • a library is used to obtain chalk spectrograms librosa;

  • in the library torcaudio there is almost everything you need to work with audio;

  • an algorithm is used to extract the pitch frequency Yin or library reaper.

Frameworks are used to develop and train deep learning models TensorFlow And PyTorch. The library is also used for machine learning and data processing scikit-learn.

Improving the model

Any model can be improved in different ways:

  • Feature engineering (feature generation): creating new features from existing data, transforming and normalizing them in order to give the model more useful information for training.

  • Modules and loss functions: adding new layers to the model or changing its structure, as well as using loss functionswhich will help the model cope better with the task.

  • Data preparation: improving data quality, marking and increasing their volume. This includes cleaning up errors in the data and addressing problems with unbalanced samples so that the model is trained on a better set.

Performance Metrics

When assessing model performance, the metric is often used Real Time Factor (RTF). It shows how many seconds of audio are generated in one second of real time.

In practice, an equally important metric is First chunk latencydelay before playback starts which we will look at in the example below.

There are two models that differ in the ability to do streaming and RTF:

  • RTF = 0.1: 10 seconds of audio are generated in one second, but the model does not support streaming. The user will hear the sound after one second.

  • RTF = 0.5: 2 seconds of sound in one second, but the model supports streaming, which allows you to reduce latency to very small values, for example 100 ms. When the model runs in streaming mode, speech is generated as the text is processed: it begins to be reproduced even before all the text has been fully processed.

Despite the fact that the first model is more productive, users will enjoy listening to the second model because playback will start faster.

Controllability of synthesis

Synthesis controllability is the ability of modern TTS systems to control intonation, stress and emotion so that speech sounds natural and conveys the desired meaning. For example, the phrase “Will you pay tomorrow?” may sound different depending on the context. In Russian, a special ascending intonation contour is responsible for interrogative intonation. It is important to be able to manage this in the speech synthesis engine in order to correct errors.

Also important are tools for correcting errors and improving the quality of synthesis, such as Resemble Enhancewhich allow you to eliminate artifacts and improve the sound of speech.

The development of synthesis models is carried out mainly by startups and corporations specializing in speech synthesis, as well as research groups.

For full-fledged work, large computing resources and investments in data collection and marking are required. Many GPUs are needed not only for training, but also in production for running models.

For small companies that need speech synthesis only as a way to interact with users, it is more profitable to use APIs from external vendors. Larger companies can install the SST system in their IT infrastructure.

Grigory Sterling, leader of the TTS team at SberDevices

What are the prospects for the development of speech synthesis?

The development of speech synthesis technologies depends on whether they are intended for businesses (B2B) or end users (B2C).

  • B2B (business-to-business)
    In the business segment, speech synthesis is often used in call centers. The main goal is to create an inexpensive system with good quality. Sometimes a slight “robotism” in the voice is acceptable, because it helps clients immediately understand that they are talking to a machine.

  • B2C (business-to-consumer)
    B2C industries, such as voice assistants and voice-overs, require high-quality speech. Users expect technology to perfectly imitate a live voice, add emotion, and instantly generate sound. Achieving this level of realism and quality requires a lot of work: experimentation, data collection, and model training. Often, such models require modern GPUs to operate, and the cost of synthesis is higher.

What you need to learn to work with Text-to-Speech

1. Python Basics

Building simple speech synthesis applications requires basic Python skills. It is important to be able to use libraries and ready-made code from open repositories.

  • Master key language constructs: variables, loops, functions.

  • Learn to work with deep learning libraries such as PyTorch or TensorFlow,to create synthesis models.

2. Mathematics

Deep knowledge of mathematics will help you better understand the mechanisms of speech synthesis models.

  • Explore linear algebra to understand the operation of neural networks and data transformations.

  • Understand probability theory and statistics, to evaluate models and work with their predictions.

  • Master numerical methods to optimize models and improve synthesis accuracy.

3. Digital Signal Processing (DSP)

Signal Processing – a very important skill for working with audio.

  • Learn to convert and analyze audio signals using digital signal processing (DSP) techniques.

  • Explore how sound waves can be represented as data for speech synthesis.

4.Deep Learning

Deep learning is at the core of modern TTS systems.

  • Explore neural network architectures such as CNN And RNN which are widely used for data analysis and generation.

  • Explore specific architectures designed for speech synthesis, such as Tacotron 2, FastSpeech And VITS.

  • Explore modern approaches in generative neural networks. For example, GAN, diffusion, flow, LLM.

5. Linguistics

Knowledge of linguistics helps create more natural speech synthesis.

  • Learn the Basics phonetics, including phonemes, intonations and rhythms.

  • Understand accentuations, pauses and speech rhythms to create more natural voices.

6. Physics of sound and biology

Understanding the physics of sound and the biology of hearing is important for producing high-quality speech.

  • Learn the Basics acoustics – how sound waves are formed, how they are transmitted and perceived.

  • Understand the work vocal cords and how different sound signals are perceived by the human ear.

After studying the basics, you can begin to master speech technologies. I recommend reading articles, for example, about the models presented on TTS Arena. It is also advisable to have an understanding of low-level work with CUDAframeworks for model reference (ONNX, TensorRT) And crowdsourcing data tagging.

Grigory Sterling, leader of the TTS team at SberDevices

Speech synthesis – what you need to remember

  • Speech synthesis is the process of converting text into audio using AI.

  • It is used in the development of virtual assistants, navigation systems, audiobooks and inclusive technologies.

  • The synthesis process involves dividing the text and then processing the data with an encoder, decoder, and vocoder.

  • The encoder converts the text into a numerical representation, the decoder generates a chalk spectrogram, and the vocoder turns it into an audio signal.

  • Sometimes a duration predictor is included in the process, which predicts the duration of each audio element for more accurate synchronization.

  • Models can cover the entire synthesis process or focus on specific stages such as acoustic processing or vocoding.

  • The technology stack for TTS development includes Python with libraries and frameworks such as librosa for audio, NLTK for text processing, and PyTorch for model building.


Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *