class of open models for speech processing

announced GigaAM model pretrained in Russian (Giga Acoustic Model) and its pre-trained states for speech recognition (GigaAM-CTC) and emotion detection (GigaAM-Emo). Today we are sharing with the community model scales and usage examples.

We invite you to dive into self-supervised learning for spoken speech and evaluate the capabilities of pre-trained models!

Self-Supervised Learning

One of the challenges in machine learning is collecting training data. For speech technology tasks, this issue is even more acute, since the data used is of a complex nature. For example, it is difficult for a person to determine the speaker’s emotion from a sound recording and to understand the content of speech in noisy conditions. To ensure the quality of the data, the same audio recording is marked by several experts. This slows down the marking process and increases its cost.

Promising approaches to reduce the amount of labeled data include Self-Supervised Learning. In this paradigm, creating a model for an applied problem consists of two stages. In the first of them, the model is trained to identify general patterns in speech on a large representative corpus. The second stage is aimed at additional training with a small amount of labeled data for the target task.

Self-Supervised Learning for Speech

Speech data shares common properties with both texts and images. On the one hand, text and speech data have a sequential structure and a temporal component. On the other hand, images and speech are continuous signals. As a consequence, most self-supervised approaches for spoken speech are based on ideas inspired by the field of Natural Language Processing and/or Computer Vision.

To pretrain GigaAM, we used the Wav2Vec2.0 approach, but before describing it, we will consider HuBERT and BEST-RQ, which will allow us to form pretraining intuition.

Predictive: HuBERT, BEST-RQ

The first approach we'll look at aims to transfer the idea of ​​learning BERT for spoken speech: use the Transformer architecture and the Masked Language Modeling (MLM) task. The main obstacle to direct application of this approach is the lack of a dictionary of discrete tokens that need to be predicted.

Authors HuBERT overcome this limitation by applying k-means clustering to features extracted from audio recordings. After training the clustering algorithm, the vector of each audio segment will correspond to the index of the cluster to which it belongs.

Thus, during training, the transformer encoder aims to predict cluster labels for masked regions of the input signal. The approach encourages the model to reconstruct missing speech content. The described circuit is illustrated in Image 1.

Image 1. HuBERT

Image 1. HuBERT

It is worth noting that model training consists of two stages. In the first of them, clustering is based on MFCC-features encoding low-level features of the signal. In the second stage, the outputs of the intermediate layers of the model are passed through a clustering algorithm to build a new dictionary with which the encoder is retrained. This allows you to add more semantic information to the target variables.

The method shows excellent results in few-shot learning scenarios and inspired further development of the predictive approach, namely BEST-RQ.

In BEST-RQ, the authors proposed to simplify the dictionary construction process. The idea of ​​the approach is illustrated in Image 2.

Image 2. BEST-RQ

Image 2. BEST-RQ

Instead of clustering, it uses a quantization module, which works as follows:

  1. The feature vector of the audio recording segment is projected into the codebook space. The latter is a randomly initialized real matrix.

  2. The cosine similarity between the projection of the audio segment and each codebook vector is calculated.

  3. The index of the codebook vector closest in cosine is used as the target variable.

After constructing discrete target variables, the encoder is trained for the MLM task. Training in the BEST-RQ scenario takes place in one stage and, more importantly, is stable and scalable.

Contrastive: Wav2Vec2.0

In approach wav2vec2.0 The encoder is trained using a Contrastive task instead of MLM. The idea of ​​the approach: we mask part of the input features (vector representations that are obtained after convolutional transformations) and train the encoder to restore missing feature segments based on the context (Image 3).

Image 3. Wav2Vec2.0

Image 3. Wav2Vec2.0

Principle of operation:

\mathcal{L} = -\log \frac{\exp(\operatorname{sim}(c_j, q_j)/κ)}{\sum_{\hat{q} \sim Q}\exp(\operatorname{sim} (c_j,\hat{q})/κ)}

Here \operatorname{sim} — cosine similarity between vectors, \kappa — parameter adjustable during training.

Unlike BEST-RQ, the wav2vec2.0 approach uses a learnable quantizer. To solve the problem of gradients flowing through a non-differentiable choice of one vector from the codebook, we use Gumbel Softmax.

Training wav2vec2.0 has its pitfalls, we talked about them at the conference Salute, Gigachat, Recording of the report is available at link.

GigaAM training

There are a large number of open source pre-trained models. For example: wav2vec2-XLS-R, HuBERT, WavLM. However, these models are trained either in English speech or in several languages ​​at once, which limits their quality on Russian-language data sets.

To achieve maximum quality in Russian, we collected a dataset of 50 thousand hours of various Russian-language data.

Of the previously described approaches for pre-training, we chose wav2vec2.0, since we already used it to improve production recognition and have experience in stabilizing learning.

The architecture chosen was Conformer, since this encoder shows results close to State-of-The-Art in speech processing tasks. The selected model architecture contains about 240 million parameters. This choice is associated with a compromise between the expressiveness of the model and the ease of additional training in conditions of limited computing resources.

Quality metrics and the model loss function during pretraining are poorly interpretable in terms of quality in applied problems. Therefore, below we will describe additional training scenarios and the results obtained in the tasks of speech recognition and determining the speaker’s emotions.

Additional training for the speech recognition task

One of the tasks for using a universal audio encoder is speech recognition (ASR: Automatic Speech Recognition) or transcription of speech content without taking into account the gender, age, or emotions of the speaker.

The main quality metric of ASR is the Word Error Rate (WER), which is calculated between the model's hypothesis and the ground truth transcription using the Levenshtein distance and can be interpreted as the proportion of words incorrectly recognized.

There are several approaches to end-to-end speech recognition. For example: CTC, RNN-T, LAS. We conducted experiments in the CTC approach, which is context-independent (token predictions do not depend on previously predicted ones, there is no consideration of language information). This choice was made to demonstrate the capabilities of the encoder-only model. To further improve quality, readers can train the model with another decoder or use Shallow Fusion with an external language model.

Most approaches to speech recognition require parallel data: an audio recording and a corresponding transcription. For additional training, we settled on the following open parallel data corpora:

Our goal was to create a multi-domain model that could recognize both an audiobook and a request to a virtual assistant. To adapt GigaAM to multiple domains at once, we used weighted sampling of training examples. Each batch of training data contained examples from 4 datasets in the following proportions:

{
	"golos": 0.6,
	"sova": 0.2,
	"common_voice": 0.1,
	"librispeech": 0.1,
}

For training we used the framework Nvidia NeMo. Additional training on 64 Nvidia V100 video cards in the Mixed Precision, Distributed Data Parallel mode took about three days.

Detailed training configuration
trainer:
  gpus: 8
  num_nodes: 8
  accelerator: gpu
  strategy: ddp
  max_steps: 100000
  gradient_clip_val: 20
  precision: 16
  sync_batchnorm: true
  accumulate_grad_batches: 4
model:
  train_ds:
    batch_size: 10
    max_duration: 25.0
    min_duration: 0.1
  spec_augment:
    _target_: nemo.collections.asr.modules.SpectrogramAugmentation
    freq_masks: 2
    time_masks: 10
    freq_width: 27
    time_width: 0.05
  encoder:
    _target_: nemo.collections.asr.modules.ConformerEncoder
    feat_in: 64
    feat_out: -1
    n_layers: 16
    d_model: 768
    subsampling: striding
    subsampling_factor: 4
    subsampling_conv_channels: 768
    ff_expansion_factor: 4
    self_attention_model: rel_pos
    pos_emb_max_len: 5000
    n_heads: 16
    xscaling: false
    untie_biases: true
    conv_kernel_size: 31
    dropout: 0.1
    dropout_emb: 0.1
    dropout_att: 0.1
  decoder:
    _target_: nemo.collections.asr.modules.ConvASRDecoder
    feat_in: 768
    num_classes: 33
    vocabulary:
    - ' '
    - а
    - б
    - в
    - г
    - д
    - е
    - ж
    - з
    - и
    - й
    - к
    - л
    - м
    - н
    - о
    - п
    - р
    - с
    - т
    - у
    - ф
    - х
    - ц
    - ч
    - ш    
    - щ
    - ъ
    - ы
    - ь
    - э
    - ю
    - я
  optim:
    name: adamw
    lr: 5.0e-05
    betas:
    - 0.9
    - 0.98
    weight_decay: 0.01
    sched:
      name: CosineAnnealing
      warmup_steps: 10000
      warmup_ratio: null
      min_lr: 1.0e-07

As expected, after pretraining the model quickly adapted to WER < 10%. Quality assessment was carried out on a delayed sample, which is a mixture of open data sets after internal validation by assessors.

To further improve the quality, we used a semi-supervised learning approach (pseudolabeling): for ≈ 5700 hours of speech data, transcriptions were restored using GigaAM-CTC adapted for recognition. Then the model was further trained on the following data mixture:

  • supervised (≈ 2000 hours), weight=0.85;

  • semi-supervised (≈ 5700 hours), weight=0.15.

This approach allowed us to improve the quality on a representative validation dataset: WER 8.0% → 7.5%. It's also worth noting that augmentation with semi-supervised data can be useful for domain shifting. For example, if the task is to recognize a podcast.

For the final quality assessment, we used 7 test data slices. Including OpenSTT datasets that did not participate in training. Each of the data sets consists of short audio recordings (up to 25 seconds) in Russian with reference transcriptions. We carried out measurements on such a number of slices to evaluate the use of the model in a wide range of applications: from audiobook recognition (Russian LibriSpeech) to queries into a voice column from a distance of several meters (Golos Farfiled).

Table 1 presents the results of comparing GigaAM-CTC with popular open-source models in terms of Word Error Rate

On average, our proposed model allows for 20% fewer errors compared to NeMo Conformer-RNNT and on 37% relatively Whisper-large-v3. It is worth noting that both NeMo Conformer and Whisper contain a more expressive decoder that adds language information to the model and allows for better learning of the alignment between audio and text. Thus, additional training of GigaAM with an RNNT or Transformer decoder can lead to further improvement in recognition quality.

SaluteSpeech App

The comparison above is for open models. Even better recognition for a wide range of domains is available in SaluteSpeech API. For example, based on our technologies, call centers can be automated.

At the same time, SaluteSpeech technologies can be useful not only in business, but also in everyday tasks – recognizing meeting recordings, drawing up meeting minutes, transcribing lectures in educational institutions.

To solve these problems we made an application SaluteSpeech App, available for Windows and MacOS. Using it, you can transcribe any audio/video files, make a summary of the meeting, and highlight key thoughts.

Please note that after registration, a freemium mode is available to all users, thanks to which you can test the application and evaluate the applicability of the technologies to your tasks.

Additional training for the task of identifying emotions

Another application of a pre-trained audio encoder is to determine the speaker's emotions. For additional training in this scenario, we settled on the dataset Dusha – to date, this is the largest dataset for emotion classification. The selected dataset consists of short speech recordings (up to 20 seconds). Some of the audio recordings were obtained with the help of voice actors (Crowd domain), the rest were collected from various podcasts (Podcast domain). Each audio recording is associated with one of the speaker’s four emotions: anger, sadness, neutral emotion, or happiness.

The task of determining emotions is characterized by an imbalance of classes of target variables: most of the time the speaker speaks calmly, without expressing any emotions. This effect appears in all datasets, with the exception of those situations where the dataset was collected in artificial conditions (for example, Dusha Crowd). To correctly assess the quality of the model, taking into account the specifics of the task, we chose the following metrics:

  1. Unweighted Accuracy – the proportion of answers correctly predicted by the model.

  2. Weighted Accuracy is the arithmetic mean of the proportion of correct answers for each class.

  3. MacroF1 is the arithmetic average of the F1-measure for each class.

To solve this problem, we added a classification module to the pre-trained audio encoder. Embeddings at the output of the encoder were averaged along the time axis and projected into the dimension of the target variable. We deliberately chose the simplest possible classification module to most clearly demonstrate the quality of the encoder representations.

Some training details:

  • No layers of the model were frozen during training.

  • Batches were formed so that, on average, the same number of examples of all classes would fall into a batch.

  • Cross-entropy was used as the loss function.

  • The training took one day on one Nvidia A100 video card.

In just a few epochs, the model converged on optimal metrics, the values ​​of which are presented in Table 2.

Crowd

Podcast

Unweighted Accuracy

Weighted Accuracy

Macro F1-score

Unweighted Accuracy

Weighted Accuracy

Macro F1-score

DUSHA baseline (MobileNetV2 + Self-Attention)

0.83

0.76

0.77

0.89

0.53

0.54

ABK (TIM-Net)

0.84

0.77

0.78

0.90

0.50

0.55

GigaAM-Emo

0.90

0.87

0.84

0.90

0.76

0.67

Conclusion

In this article we presented a line of models for processing spoken speech:

  • GigaAM pre-trained on a variety of Russian speech and can be quickly adapted to different tasks (speech/emotion/speaker recognition) and domains (call center, podcasts, farfield).

  • GigaAM-CTC makes 20–37% fewer errors in words on short Russian-language queries compared to such popular solutions as NeMo-Conformer-RNNT and Whisper-Large-v3.

  • GigaAM-Emo shows that GigaAM can be adapted not only for speech recognition: during additional training, the model begins to quickly “catch” signal patterns responsible for the emotional state of the speaker.

We hope that Open Source models and the described approaches for additional training will accelerate progress in the field of speech technologies and will be useful in writing dissertations and scientific articles.

ML specialists from the SberDevices speech technologies department worked on the project and article: Alexander Maksimenko, Nikita Koryagin, Georgy Gospodinov, Pavel Bogomolov. We express special gratitude to Oleg Kutuzov and Alexander Glagolev for their invaluable assistance in collecting data.

If you would like to work with us to improve speech recognition from speakers to high-load call centers, build a single speech model for several tasks, quantize models into a format of less than a byte per parameter, and solve non-trivial NLP problems in summarization and search through long dialogues, feel free to write Yule!

If you would like to dive deeper into the challenges of speech technologies in general and self-supervised / semi-supervised approaches in particular, we invite you to course from our team!

We also invite you to the Telegram channel Salute AIwhere SberDevices ML specialists share their developments in NLP, CV, Speech and other areas!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *