Synthesis of emotions. Inhale-exhale model

I decided to try writing a few articles about speech synthesis with emotion support.

It all started when I decided to make a simple MVP for improving spoken foreign language based on neural networks, an online tutor. Since I myself have difficulties with learning it.

But during the implementation process, I used different models. From Fastpitch and Tocatron2 to Bark from Suno. When I tested my first MVP, when listening to a synthetic voice for a long time, my voice began to hurt and irritation arose. This was especially true when the voice acting did not match the context. An analogue of the “uncanny valley” effect arose, but only for sound.

This made me try to find a solution that would make the voice more emotional. Here I will describe how I started to transfer the biological model to the synthesis of meshes.

My first step was to develop the “inhale-exhale” model. The idea was that 99.999% of people speak exclusively on the exhale (this also applies to animals).

Inhale-exhale model

Inhale-exhale model

The model boils down to:

  • breaking up a sentence into phrases that correspond to an exhalation. I then believed that the maximum number of words a person can say is 7 (this is not far from the truth, although I abandoned this approach in the future) + punctuation.

  • calculation of the length of the breath, at that time it was considered as random in a certain range, which depended on the synthesized emotion

  • calculation of exhalation, depended on the number of phonemes and the strength of emotion. Synthesized a phrase 3 seconds long, and the exhalation time is 2.5 seconds. So it is necessary to speed up or, on the contrary, stretch it. Where is the strength of emotion [0,1]she both sped up and slowed down these 2.5 seconds

  • for each emotion (and I had a palette of about 30 emotions), its own normal exhalation length was calculated. The sign of the strength of emotions and the calculation also depended on the emotion (amplifying, inhibiting).

  • there is the time of synthesized audio (Ta) and the time calculated on phonemes (Tn).
    if Ta > Tn, then speed up the audio by Ta / Tn
    if Ta < Tn, then we can speak louder (we have more air for a short exhalation). We increase the volume by Tn / Ta

  • the volume and tone are also affected by the emotion and its strength. For example, for enhancing emotions it was like this

        # базовое значение тона эмоции при спокойном состоянии
        pitch = emotion_pitch[em_name]
        # дальше шел расчет громкости и тона
        # emotion_volume максимальная громоксть эмоции, 
        # считалось, что чем быстрее говорим, 
        # тем сильнее эмоция и значит громче говорим
        volume = emotion_volume[em_name] / Ta
        # дальше усиливали или ослабевали кромкость и тон, исходя из силы эмоции
        volume = (em_score - 0.5 + random.uniform(-0.1, 0.1)) * volume
        pitch = (em_score - 0.5 + random.uniform(-0.1, 0.1)) * pitch

Thus, the volume was influenced by the speed of inhalation and exhalation and the strength of the emotion.

This model was good because it could be applied to any synthetic models. We recognize emotions in the text, then break the text into phrases, then apply the model to the resulting audio. And such a simple solution makes the resulting speech more emotional.

For example, this model superimposed on Silero:

Example of the emotion text “confused”, without the inhale-exhale model

Example of the emotion text “confused”, with the inhale-exhale model

Example of the emotion text “fear”, without the inhale-exhale model

Example of the emotion text “fear”, with the inhale-exhale model

The final version based on this model.

PS: Let's see if there will be interest in continuing the description of how this model developed to the “lungs-oxygen”, “heart-lungs”, “speech tract” models.
Here you can take a look and decide where I am going and whether I should continue the series of articles.

Hidden text

When I made the model, and there are examples in the comments: https://t.me/greenruff/1741
Example description: https://t.me/greenruff/1792

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *