Comparison of Russian language recognition systems 2024

After a significant pause, it’s time again to update our research (past, the year before) on the quality of Russian language recognition systems. Again, we didn't think we'd get to this point and were surprised by the results.

This time the situation is like this:

  • This time Tinkoff and VK were not interviewed;

  • We did not interview any foreign vendors for obvious reasons;

  • This time, “noise” was added to the list of domains for validation – checking what % of the dataset from various noises the various models will “ignore” and not produce supposedly “recognized” speech;

  • This time we added to the comparison not only closed proprietary systems, but also the so-called open “foundation model” from Sber – GIGA AM;

  • We did not survey the various publicly available recurrent models, since in our opinion they are either not sufficiently sophisticated, or their small versions are of too low quality;

  • Testing of our models, Sber and Yandex was carried out just recently – in early October;

  • Of our models, this time, for an adequate comparison, we present a fast GPU model and a slightly slower GPU model.

Methodology Changes

The methodology did not change, we just added a noise dataset to test how different systems react to noise. We did not touch the markings, even the curve, to ensure comparability of the results with previous studies.

This time there were almost no problems, except that Sber had to send the audio to the streaming interface again, because the normal interface did not have the necessary flags.

In short, only denormalized data (only letters) without the letter е are compared, because we are comparing recognition, and not just how well text normalization patterns match the pattern in ground truth.

WER is calculated as a whole for the dataset, and not the arithmetic mean of all audio.

Dry metrics

The main metric is WER (word error rate), expressed as a percentage for clarity. For noise, instead of WER, we simply count the % of audio in pieces where the model produced anything other than “blank.”

Brief Analysis

I admit, we made this comparison with some existential horror. Because we can only imagine what clusters of thousands or tens of thousands of video cards are behind these results and APIs, and convergence and entropy are very cruel things.

Firstly, again in a year and a half, all services have grown pleasantly in their metrics.

GIGA AM for the public non-recurrent model shows very impressive results. Approximately at the level of paid Sber models of last year or the year before. At the same time, it is clear what she was trained on, but she shows impressive generalization primarily in complex domains.

Again, paid services and our “fast” one (this is a GPU model, like paid services) have some convergence of results. At the same time, it is clear that Sber actively uses external data for its paid model too. Our fast model historically fails on smart speaker data.

Yandex was pleasantly pleased with the recognition of addresses (here in the dataset there are mainly all sorts of crooked and complex addresses, not only super-frequent ones). But this is probably the reason why Yandex does not provide its speech recognition to other taxi companies. In a sample weighted by more frequent addresses (taxi), the difference is practically not visible.

The big difference between Sber and Yandex metrics on the smart speaker datasets is also striking. In principle, it is logical and predictable – they both have such a column.

Our fast model and Sber have some problem with noise. We used a synchronous API for polling, but in practice there are other ways to suppress “noise”.

The biggest surprise for us was the metrics of our highest quality model. Naturally, it is not recurrent, but it was a pleasant surprise. We literally didn’t know what it was like until the last moment and just recently rolled it into a demo.

Funny pictures

When I generated the cover for the article, a certain number of funny pictures were generated. They are under the spoiler. Try to guess what I mixed with what. For the title I left a picture with headphones to make it more obvious.

Hidden text

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *