Augmentation of expressive audio data based on TTS

In this article, we will talk about TTS-based (Text-to-Speech) voice cloning systems, which we use at the ITMO corporate human-machine interaction laboratory to augment speech databases as part of the task of multimodal recognition of speaker dominance in polylogs. I would like to note that this text is rather a brief overview of modern methods and technologies that can be useful in solving such problems. It is assumed that the reader has at least a basic knowledge of machine learning.

So, I would like to start our conversation from the very basics – why do we need to augment audio data, how to present TTS as a machine learning task, and, finally, move on to considering the technologies that have been used in solving the problems of our laboratory.

Why is audio data augmentation necessary?

Modern solutions used in the field of speech technologies are based on neural networks and, accordingly, require extensive training samples. And to solve the problems of speech and emotion recognition, speaker identification, as well as audio synthesis, data corpora with expressive speech are needed. It is easy to guess what problems arise when collecting such data. First, it is necessary to evoke the desired emotion in a person, however, not all reactions can be evoked in simple and natural ways. And secondly, the use of recordings of professional actors can lead to significant financial costs, as well as to the “acting” of the emotions received.

Obviously, due to these reasons, at the moment there is a shortage of expressive speech data in Russian available to the general public of developers. But still, you can find examples of data corpora that can be used to tune machine learning models – acoustic datasets soul (250 hours + 100 markup without audio, 4 emotes) and animore (4 hours, 7 emotions). However, despite the high quality of recording and markup, the problem of the imbalance of the presented emotions still remains.

To solve these problems, there are two groups of approaches. The first group is standard audio conversions (time stretching, volume change, tempo change, pitch shift, background noise or silence addition, frequency and time masking, and combinations of these methods). The second group of approaches, which we will consider in more detail, is based on the generation of synthetic data using machine learning methods, for example, using TTS models based on generative adversarial networks (GANs, generative adversarial network) and variational autoencoders (VAE, variational autoencoder).

How to present TTS as a machine learning task?

TTS speech synthesis systems allow the generation of a human voice using computers. The main requirement for the TTS system is the naturalness of speech and its intelligibility. Before proceeding to a story about how modern TTS systems still work, let’s look at the process of human speech production in order to understand the main features that matter for the generation of audio data.

In the process of human speech production, the nervous system transmits nerve impulses to muscle fibers that set the organs of the vocal tract in motion. During the work of the speech organs, during which the vocal tract constantly changes its configuration, aerodynamic processes occur with the air flow from the lungs, which is converted into a periodic complex sound. The output signal consists of the fundamental oscillation frequency − pitch frequency F_0and harmonics of this frequency, giving the voice a certain – timbre. Contour change F_0defines intonational features, and even emotional speaker’s state. The spectral picture at the output mainly depends on the fluctuations of the vocal folds in the larynx, which acts as resonatorthat amplifies the sound. The signal energy distribution along the frequency axis is characterized by formantswhich are due to the work of the vocal cords.

Human vocal tract

All of the above features (signs) can be found on spectrogram. A spectrogram is a graph showing the value of the signal amplitude versus frequency at time t, obtained after calculating the squared amplitude for the windowed Fourier transform (STFT) to the audio signal. The essence of STFT in the case of discrete transformation is that the original signal in the time domain is divided into overlapping parts, on each of which the Fourier spectrum is calculated, followed by a shift by a given step.

The classic TTS problem statement is a prime example of supervised learning (supervised learning). The original TTS problem can be described in terms of machine learning as follows. Let u_j is the j-th sound uttered by the speaker, and the corresponding indicative description of this sound is x_j,Where 1≤j≤M. Synthesis model \mathscr{S}calculates an estimate  \tilde{x}_j=\mathscr{S}(t_j |θ_\mathscr{S} ),where t represents the phoneme for u_j,A θ_\mathscr{S}– parameters of the synthesizer. Now the vocoder \mathscr{V}evaluates \tilde{u}_j=\mathscr{V}(\tilde{x}_j |θ_\mathscr{V} ). In general, it is necessary to optimize the following quality function:

\min_{( θ_\mathscr{V},θ_\mathscr{S} )}⁡\mathscr{L}(u_j, \mathscr{V}(⁡\mathscr{S}(t_j |θ_\mathscr{S} ) |θ_\mathscr{V} )),

Where ⁡\mathscr{L}– losses.

This expression is a quality function in the time domain, which is a composition of models, which, generally speaking, can complicate convergence algorithm to a local optimum. Therefore, it is necessary to separately train the models in the time-frequency and time domain respectively:

\min_{θ_\mathscr{S}} ⁡\mathscr{L_\mathscr{S}}(x_j, \mathscr{S}(t_j |θ_\mathscr{S})),\min_{θ_\mathscr{V}} ⁡\mathscr{L_\mathscr{V}}(u_j, \mathscr{V}(\tilde{x}_j |θ_\mathscr{V})).

From this follows the basic training scheme of most modern pipelined TTS models:

Speech synthesis process

The first stage is a linguistic analysis, which includes text normalization, word splitting, morphological markup, transformation of graphemes into phonemes (G2P), as well as highlighting various linguistic features. The second stage consists of the process of converting the input sequence of phonemes into a spectrogram – representing the signal in the time-frequency domain. The last stage is the restoration of the sound wave from the spectrogram. For this, a special algorithm is usually used – vocoder.

The power of generative modeling

Recently, the quality of modern (SOTA) models of adaptive synthesis is already comparable to real human speech. In many ways, this was achieved through the use of end-to-end TTS models using data-driven methods based on generative modeling. It is this approach based on generative architectures that our team used in their work.

In fact, there are subtle differences between the various deep generative models and their learning methods. For example, VAE consists in determining the probabilistic model of the joint distribution of latent variablesz \sim \mathcal{p}(z)and data x \sim \mathcal {p}(x|z)A GAN-s do not generate latent representations of the training data and do not model the distribution of the data, but it can be said that the distribution model is implicitly trained. However, we will not dwell on this in detail. The main idea is to make a conditional assumption that there is such a joint distributionP(x)orP(x,y), which includes an infinite number of pairs of features and labels (x,y),including samples from the training and test sets. Now we can create new data samples of the same type as the data from the training set.

We started our research with a model FlowTron. This is a TTS network with the ability to control the style and variability of the speaker’s speech, using such a generating mechanism as a stream. Frame generation p(x)in this model occurs on the basis of previous frames p(x)=\prod p(x_t |x_{[1:t-1]} ), which makes FlowTron an autoregressive model. FlowTron learns to invert data mapping (chalk distribution) into a parameterized latent space that can be manipulated to control many aspects of speech (pitch, tone, tempo). Taking as input a vectorized chalk spectrogram as a random variable obtained “by sampling” from an unknown distribution, the flow engine calculates the probability density function of the original variable:

p_X (x)=p_Z (f(x))|\det {\frac {\partial f} { \partial x)) |=p_Z(f^{-1} (x)) |  \det⁡{\frac {\partial f^{-1}} { \partial x}} |,

Where Zis a random variable with an interpreted probability density function, X=f^{-1}(Z), f^{-1}– an oscillator that converts noise p_Zinto a complex distribution, fis a stream that transforms a complex distribution into a simpler one (noise).

It is noteworthy that in comparison with VAE and GAN, flows give a more accurate estimate of the distribution density of the observed data. This mechanism is trained in a standard way – by maximizing their total log-likelihood. In practice, of course, we assume that the latent variable is normally distributed. z \sim \mathcal {N} (z;0,I).In this case, the composition x=f_0∘f_1∘…f_k (z)is called a normalizing flow. flow step min the FlowTron model, it consists of an affine linking layer and takes as input the spectrogram of the phrase spoken by the speaker (speaker_id):

x_{t}^{(m+1)}=f_m (x_t^m)=s_t^m∘x_t^m+b_t^m,

Where (\log⁡{s_t^m},b_t^m )=NN(x_{[1:t-1]}^m,text,speaker\_id), NN(.)is some autoregressive transformation, s_t^mis the scaling factor, b_t^m– displacement, operation ∘– Hadamard’s work.

It should be noted that the spectrogram used in FlowTron is not simple, but weighted chalk filterswhich are used because of the peculiarities of the psychophysiological perception of sound by a person: the human ear is more sensitive to sound changes at lower frequencies than at high ones.

Flowtron architecture

The generation process based on text and speaker is represented as:

x_t^m=(x_{t}^{(m-1)}-b_t^m)/(s_t^m ),

Where (\log⁡{s_t^m},b_t^m )=NN(x_{[1:t-1]}^{m-1},text,speaker\_id).

Now you can make selections from the desired areas of latent space, corresponding to certain characteristics of speech in the space of chalk spectrograms. But that is not all. As it turned out, FlowTron learns a latent representation that stores non-textual (paralinguistic) information, and also allows you to increase the variation in chalk spectra and perform speaker style transfer.

Then, right after the learning process, we started experimenting with a data corpus that is based on the voice acting of the game. The Elder Scrolls V: Skyrim in Russian. The table below shows the quality scores of the final synthesis:

Experiment

MOS

Change content

4.0996

Gender Preserved Announcer Replacement

4.4828

Announcer replacement with gender change

4.3103

Transferring the style to a cue from the same announcer

4.4828

Transferring style to another speaker’s cue

3.9310

  • Modern solutions used in the field of speech technologies are based on neural networks and, accordingly, require extensive training samples. And to solve the problems of speech and emotion recognition, speaker identification, as well as audio synthesis, data corpora with expressive speech are needed. It is easy to guess what problems arise when collecting such data. First, it is necessary to evoke the desired emotion in a person, however, not all reactions can be evoked in simple and natural ways. And secondly, the use of recordings of professional actors can lead to significant financial costs, as well as to the “acting” of the emotions received.

    Obviously, due to these reasons, at the moment there is a shortage of expressive speech data in Russian available to the general public of developers. But still, you can find examples of data corpora that can be used to tune machine learning models – dusha acoustic datasets [https://github.com/salute-developers/golos/tree/master/dusha#dusha-dataset] (250 hours + 100 markup without audio, 4 emotes) and animore [https://github.com/aniemore/Aniemore] (4 hours, 7 emotions). However, despite the high quality of recording and markup, the problem of the imbalance of the presented emotions still remains.

  • It is noteworthy that in comparison with VAE and GAN, flows give a more accurate estimate of the distribution density of the observed data. This mechanism is trained in a standard way – by maximizing their total log-likelihood. In practice, of course, we assume that the latent variable is normally distributed z~N(z;0,I). In this case, the composition x=f_0∘f_1∘…f_k (z) is called a normalizing flow. The flow step m in the FlowTron model consists of an affine connecting layer and takes as input the spectrogram of the phrase spoken by the speaker (speaker_id):

To calculate grades MOS (mean opinion score), it is necessary to take the arithmetic mean of the scores for the quality of the synthesized speech, set by special people on a scale from 1 to 5. Note that this score is not absolute, since it is subjective, therefore, comparison of experiments performed by separate groups of people at different times on different data is incorrect.

At first we thought that the augmentation problem was solved, because we have a model that shows good quality. However, we ran into some difficulties.

  • With a significant discrepancy in the length of the phrase, the sound is greatly distorted

  • Style transfer is unstable

  • The quality of the data used in the training phase has a critical impact on the final result when changing the speaker

Then it became clear that FlowTron was not quite suitable for us, and we began to look for other models that could potentially be used to solve our problem.

It seemed to us that one of the possible systems that could provide a scalable solution to the problem of voice cloning with good performance is VITS – parallel end2end TTS system. VITS uses a Variational Auto-Encoder (VAE) to connect an acoustic model to a vocoder via a latent (hidden) representation. This approach generates high-quality audio recordings by improving the expressive features of the network using the normalizing flow mechanism and adversarial learning in the time domain of the signal, and also adds the ability to pronounce text with different variations thanks to latent state uncertainty modeling and a stochastic duration predictor.

Model VITS

VITS has notable features:

  • Synthesizing a time-domain signal directly from text without additional input conditions

  • Uses an algorithm MAS (Monotonic Alignment Search), which is one of the varieties D.T.W. (dynamic time warping), to search for sequence alignment without calculating network losses. Alignment very important in the model, as it can give an idea of ​​which parts of the input data are more relevant for generating the corresponding frame of the chalk spectrogram

  • Signal generation occurs in parallel

  • VITS outperforms dual conveyor models

From the figure above, it can be seen that VITS consists of an a priori encoder, a stochastic duration predictor, a posteriori encoder, and a decoder. When training, VITS uses an assessment ELBO (Evidence Lower Bound), which consists of the expected log-likelihood and regularizerwhich penalizes the posterior distribution for a strong deviation from the prior:

\log⁡ {p_θ (x|c)} ≥ \mathbb{E}_{q_ϕ (z|x)} [ \log {p_θ (x|z)}⁡ - \log⁡ { \frac {q_ϕ (z|x)} { p_θ (z|c)} } ],

Where p_θ (x|c)– credibility of the data, p_θ (x|z)– data plausibility (modeled by the decoder), q_ϕ (z|x)– posterior distribution (modeled by the encoder),  \log {p_θ (x|z)}– reconstruction error,  \log⁡ { q_ϕ (z|x)} - \log { p_θ (z|c)} – Kullback-Leibler divergence, p_θ (z|c)– a priori distribution of the latent state, which is calculated through the mechanism of normalizing flows.

But we went even further and used VITS modificationwhich is based on technology transfer learning and instead of phonemes, the input of the text encoder requires a representation computed using the approach based on Hubert (Hidden-Unit BERT). HuBERT applies clustering based on K Means to short audio segments (25 milliseconds), and uses the resulting clusters as discrete hidden blocks (hidden units) that the BERT model will learn to predict.

Hubert
VITS modification

VITS modification

Thanks to this modification, we were able to get a number of improvements. Firstlyhigh-quality and reliable adaptive synthesis with preservation of pronunciation and emotions. Secondlythere is no need for a text annotation. Third, requires a minimum amount of audio data for training. And finally fourthlywe now have the ability to change F_0However, the presented model does not work in the mode zero shot (a mode that allows you to learn without any labeled examples) Unlike YourTTSwhich uses a full-fledged neural network speaker encoder and is also based on VITS.

So how do we perform voice conversion? Everything is quite transparent here. During training, we deliberately remove the speaker IDs so that the text encoder learns the representations eindependent of the speaker. For a given speaker sthe chalk-spectrogram is computed, and using the a posteriori encoder and a priori encoder streams, a speaker-independent representation is computed: z \sim \mathcal q_ϕ (z|x\_lin,s),e=f_θ (z|s)The process of voice substitution in this case is described as \tilde{y} =Decoder(f^{-1}_θ (e|\tilde{s})|\tilde{s}).Now everything is limited only by our imagination, because we can be conditioned not only by the announcer, but also, for example, by emotion. True, we are still working on this.

We decided to use this approach to augment the dataset MELD for further work on multimodal recognition of speaker dominance and valency in polylogs. After that, we asked ourselves the question: how can we now change the content (specific words) in the generated audio recording, if, of course, we want to? As a result, a method was proposed that allowed this to be done, but so far the resulting quality leaves much to be desired. This method is as follows. First we need to align the text and content vectors on the output of HuBERT for the audio recording, as well as cluster the content vectors to label the input graphemes. Next, you should use an already pre-trained seq-to-seq model (sequence-to-sequence transformation) to translate the input sequence of graphemes into cluster indices. This will allow replacement of content vectors in the HuBERT model. It remains only to predictF_0 for new content vectors. Fortunately, there are already existing tools for this, for example CREPE.

Content Change Scheme

Content Change Scheme

Conclusion

It is now no secret that with the growth of audio data and the development of semi-automated learning methods, deep learning models are becoming dominant in the field of TTS. This is probably due to the use of the potential of deep architectures in learning hidden representations, which allows overcome fundamental limitation methods that use predefined characteristics for the processed data. This approach made it possible to increase the generalization of models on various speakers and previously unseen words, which had a positive effect on the task of adaptive synthesis. The use of generative modeling, which complements neural models with an adversarial learning process, has allowed TTS systems to increase their expressive capabilities.

We have carried out experiments, as a result of which it was confirmed that FlowTron can be successfully applied to speech data augmentation. This model performs well in content change, speaker replacement, and style transfer. However, FlowTron has some disadvantages, such as jumps in synthesis quality and a strong dependence on the “purity” of training data. The modification of VITS using the HuBERT model is higher quality and more reliable for adaptive synthesis, but the content modification scheme still needs to be improved. We are going to use this approach for data augmentation in the multimodal speaker dominance recognition problem. Also, our team plans to add the ability to change the pace of speech, the emotions of the announcer, as well as change F_0 according to intonation patterns (arrangement of intonations, logical selection).

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *