A hammer drill is not a hindrance. Enjoying “clean” sound with a denoizer from SaluteJazz

Imagine the situation: you work remotely and have to take part in an important video conference, and at that moment a domestic cat decides to start a sports race, demolishing everything in its path. A child began to cry in the next room, and at that very moment the neighbor decided to drill several holes for a new shelf. Something irreparable could happen, but in order to avoid a catastrophe, we came up with our own solution that will prevent your interlocutors from noticing anything suspicious.

My name is Artem Sokolov. My colleagues and I are working on improving sound in the SaluteSpeech team. We are in SberDevices We develop and develop a whole line of B2B solutions – from speech services to video conferencing. And we strive to use our own technologies in all our products.

One of the flagship products we are creating is a video conferencing service. SaluteJazz. First of all, it is focused on business communications, which require high quality sound without extraneous noise.

Taking into account the hybrid mode of work in many companies, and in general a fairly dynamic modern lifestyle, it is not always possible to call colleagues or partners from an absolutely quiet room, and therefore a situation similar to the one I described at the beginning of the article may occur. For this purpose in SaluteJazz There is a “noise suppressor” that removes all extraneous sounds.

The first solution for noise reduction in SaluteJazz was from a third party company. And, as you can easily guess, it did not assume the presence of a model on the device. As mentioned above, it is important for us to use solutions created by ourselves, so in a fairly short period of time we prepared and built in our own “noise suppressor” (also known as a denoiser). This article will be discussed about him.

Introduction

The task of noise suppression itself is simple and straightforward. You need to be able to remove as efficiently as possible all additive noise, such as cars on the street, extraneous conversations in a coffee shop or restaurant, loud games of children or animals, a hammer drill from a neighbor upstairs, and others. In this case, it is important not to damage or minimally (imperceptibly for human hearing) damage the speaker’s speech.

A digital signal is numbers that characterize the amplitude of vibrations of the membrane in a microphone. If you draw them, you get oscillograms similar to this one.

Such images are also called time domain images because they show fluctuations over time.

For time domains, the problem of noise removal is mathematically formulated as follows:

Using a Python script, we fed the audio dataset to the input virtual channel and received cleaned audio on the output virtual channel. For cleaning we used a trial version of Krisp for MacOs, where according to the informationPerformant Real Time Audio ML in the Browser the model can occupy 48MB or 5.6MB. The solution operates at a frequency of 32 kHz, that is, automatic resampling is assumed in the case of a higher audio frequency.

The application uses both models depending on the desired level of cleaning/loading of the processor and platform; on the web there is a light model. We don’t know anything about the mobile version of Krisp, but it’s logical to assume that they also use a lightweight model. Our solution consists of three models depending on the platform: 20Mb, 10Mb and 6.7Mb. The main frequency is 24 kHz, higher frequencies are supported using resampling. The quality between 24kHz and 32kHz is barely noticeable for the average video conferencing user, but to be fair, it is there.

Thus, the line of sound cleaning solutions can be compared as follows:

	Desktop, MB	Mobile, MB	Web, Mb
Krisp 32kHz model	48/5.6	5.6	5.6
Jazz 24kHz model	20/10	10	6.7

Our test data covers the following cases:

Recordings with low and medium noise (signal-to-noise ratio (SNR) in the range from 20 to 5 dB);
Recordings with high noise levels, SNR range 4 to -15;
Recordings with noise in the background of reverberations (repeats of sound reflected from the walls in rooms when you use a laptop in a conference room);
English records from a DNS test dataset with low to moderate noise.

We ran our 4 test sets through both Krisp models using the method described above.

The collected data was posted on the Yandex Toloka crowdsourcing platform in the form of an SBS survey (Side-by-side). People voted on each entry to determine which they liked best. We pairwise compared the cleaning results of our large model with their large model, and our small model with their small model. Our mobile phone solution was compared with both Krisp models.

The comparison results are shown in the diagram: an example of comparing the cleaning results of two large models on a data set with low and medium noise levels. Lilac and orange indicate the proportion of pairs of recordings for which listeners preferred our solution and Krisp, respectively. Green indicates the percentage of audio where priority was not given to any of the models. Here you can see that the vast majority of respondents prefer cleaned audio.

We summarized the results in a table, preserving the SBS color scheme to make the results easier to understand.

The results clearly show that in the vast majority of cases, our solution copes better with typical low to medium noise. Even on the English data set. If the noise level increases, then both Krisp and the noise suppressor SaluteJazz show approximately the same result with a slight advantage of our solution. But, unfortunately, our “noise suppressor” copes worse with reverberation. The recordings show that the Jazz denoiser will remove noise, but the actual conference room sound will remain intact, while the Krisp will enhance that sound. The most interesting thing is that Krisp's small model performs better in reverberation conditions.

Overall, we are very, very pleased with the result of the comparison.

Conclusion

Our solution with different model options works On‑device on clients SaluteJazz on Android, Web and native applications for Windows, MacOS and Linux. We are very careful about quality and have done a lot of research before releasing our “noise reduction” in a product for general use. SaluteJazz. Comparison with Krisp gives confidence, because against the backdrop of a product with great popularity in the world and proven quality of the solution, it is in the field of noise removal that we look very respectable. We need to learn how to work better with reverbs, and we are working in that direction.

Currently, your video conference is equipped with high-quality noise reduction, which means that the interlocutor on the other side of the screen will only hear the information you broadcast – the child’s cry and the punch will remain behind the scenes.

Latest releases SaluteJazz They already include a couple of our models for lightweight, energy-efficient, as well as active, but slightly more resource-demanding noise reduction. In the settings menu, you can choose the mode that suits you best (but we strongly recommend trying out the strong noise reduction mode if your mode allows it). Depending on the platform, there are various versions of our models, the creation of which we described here.

SaluteJazz is available to all users, so anyone can test our denoiser and the service as a whole.

I look forward to your feedback in the comments. Ask questions, share your opinion about our solution and experience of use.

Meta Platforms*, as well as its Facebook** and Instagram**:
* recognized as an extremist organization, its activities are prohibited in Russia
** prohibited in Russia