How to improve the quality of WebRTC calls using the example of the VK Calls service

Hello! I'm Ivan Shafran, I've been working with WebRTC for 4 years now, and on VKontakte I lead the Android calling team. In this article, using VK Zvonkov as an example, I will tell you what can be done to improve the quality of services for audio and video communications. Let's discuss the advantages and disadvantages of WebRTC. I’ll tell you how to work with audio, video and screen sharing mode and what options there are for collecting statistics.

The article was written based on the report “VK Calls: Raising the bar for the quality of WebRTC calls” at Mobius Spring 2024.

What features does VK Calls have?

VK calls are used by about 20 million people a month. We have no restrictions on the number of participants and duration of the meeting. The service features joint video viewing, 4K screen sharing, animated avatars, smart noise reduction and AR background replacement technology. There is also a standard set of tools: broadcasts, scheduling, administration and call recording.

But all these features are meaningless if you can’t hear someone on the call, the video is slow, and the screen sharing is blurry. Let's look at a few techniques that will help avoid this.

Users who experienced poor call quality

Users who experienced poor call quality

WebRTC

Our calls use WebRTC technology. It has been used in Google's Chromium project since 2011. Later it appeared in Yandex Browser, Google Chrome, Microsoft Edge, Safari and Mozilla.

The WebRTC SDK is available on iOS, Android and desktop operating systems. If desired, the tool can even be built into a refrigerator or robot vacuum cleaner.

WebRTC is a real-time messaging technology. This is an open source framework that provides a set of basic tools for self-assembling a working service. However, to achieve high quality, the product will have to be refined.

Quality and optimization

Let's look at several practical techniques for optimizing audio and video, as well as screen sharing mode. We'll start with the simple ones and move on to more complex tips so you can find something useful for any project.

Audio

The main thing, without which the call will not take place. Users are willing to tolerate temporary loss of video, but not sound.

Speaker highlight

Look at the image below. How can you tell if someone is talking on a call? Highlighting the speaker on the screen helps with this.

A standard technique is to outline the speakers: for example, in green, as in the image below.

WebRTC has a set of technical characteristics of the connection, which allows you to know the volume level of each audio stream and identify the speaker. The class is used for this RTCAudioSourceStats and his field audioLevel.

The graph shows that the volume of an audio stream is not a binary attribute, and most often its level is above zero due to background noise.

You can fix this problem by reading the average volume level throughout the call. For example, using a “moving average” to adjust to changing conditions. In practice this works well, but you can also connect the Voice Activity Detector from the WebRTC distribution. By default it runs in the pipeline on audio output, but it can also be used for input. In group calls, it would be more efficient to calculate such information on the backend to reduce the load on the client.

The next UX task is to make it clear to the user that he is speaking with the microphone turned off. You can do this using the approach described earlier, and it will work because the microphone does not turn off the calling application, but only stops the data transfer. This increases the speed of turning on the microphone, otherwise part of the speech will be eaten up by a delay during initialization.

Call Permissions on Android

The following permissions may be required for calls to work:

• access to audio;

• access to video;

• screen sharing;

• access to device memory;

• access to connected Bluetooth devices.

All permissions except audio can be distributed as needed. This way, the user won’t have to answer multiple requests when starting a call. If we allow access to audio when the microphone is turned on using the button, we will begin to lose the connection on the other side.

This is because we use the audio channel to identify network problems. If the audio data is not coming through, something is wrong.

To fix the problem, we:

• made a fork of WebRTC;

• found a class org.webrtc.audio.WebRtcAudioRecord;

•began to send silence until permission to use the microphone was received.

Audio codecs

Audio codecs for WebRTC are easy to use. There is a free open codec called Opus, which is suitable for working with different bitrates.

This codec has a function for correcting future packet losses – FEC, or Forward Error Correction. The principle of its operation is that fragments of previous packets are added to audio packets. If anything is lost, we can restore the audio.

This is a useful mechanism, but standard FEC has a number of disadvantages:

• turns on when losses have already occurred;

• has restrictions on the share in the packages;

• itself is taken into account as part of the bitrate and makes it difficult to configure.

As an alternative to FEC, you can try the RED mechanism – Redundant Coding, that is, redundant coding.

Its operating principle is similar to FEC. RED improves audio quality by adding entire previous packets to the audio channel, although it is lower quality compared to video and modern network capabilities. Nevertheless, RED performs well in networks with a large share of losses.

The graph above shows data on the number of synthesized audio packets. WebRTC forms packets if loss occurs, compensating for noticeable speech “breaks”.

During A/B testing, we also noticed a significant drop in the quality of the synthesized audio packets. Other metrics haven't changed much.

Video

Video is the second most important communication channel between users.

To optimize data transfer, video codecs use a “difference” mechanism between frames, which loads the device. For this reason, mobile devices often get warm. Let's consider several options for solving the problem.

Basic settings

There are more codecs for video than for audio. The most common and reliable is the outdated H.264. It is supported on most devices.

Another basic setting worth paying attention to is resolution. For desktops and mobile devices, resolutions of 720p, HD or lower are sufficient. If you lower the resolution, it will optimize the entire pipeline – from capturing frames with the camera to sending them over the network. Do not neglect this setting if your service usage scenario allows it.

Experimenting with codecs

If the channel has sufficient bandwidth, the difference between codecs is not very noticeable, since we do not display video in high resolution. In conditions of poor bandwidth, codecs win because they compress video better and at the same time slightly lose quality.

Newer popular codecs have better transmission quality. They also compress better on poor networks. However, their novelty has a disadvantage: most Android devices only have their software implementation, which runs on the CPU. Not all devices have hardware support—encoding and decoding on a dedicated chip.

We may use new hardware codecs if they are available. If not, then we use new codecs in software when the network bandwidth is poor. In the latter option, the picture will be better, but the battery will drain faster. The load on the CPU can be further eased by lowering the frame rate and lowering the resolution. This will look better than a scattered frame from H.264 on a poor network.

Problems with large calls

For one-on-one calls, almost all features work out of the box without errors. But if more than six videos appear on the screen, problems with hardware codecs may begin. They are implemented on a chip and have limitations on the simultaneous number of decoders. Thus, you will have to turn to software decoding after exceeding a certain number of videos in the call.

If we go even further, then problems will arise on the part of OpenGL. The so-called contexts that are responsible for the state of the engine will run out. They are needed to work on threads where OpenGL calculations are performed. Typically, there are about 30 contexts available on devices. By default, two OpenGL contexts are allocated for each video – for decoding and for drawing. We will be able to see about 10 videos instead of the possible 15 due to the delay in the “release” of contexts and their use during encoding.

To solve the problem with contexts, you can rewrite the rendering so that it runs on one thread for all videos.

For large calls with 100 or more participants, there is wide scope for optimization. We analyzed solutions for VK Calls on Mobius. You can watch the video follow the link.

Screen sharing

Screen sharing is one of the most popular additional call features. Depending on the scenario, your first priority may be the smoothness of images and game content, or the clarity of text and graphics for presentations.

WebRTC on Android already has a class for screen sharing ScreenCapturerAndroid. However, it performs mediocre for both scenarios. The hardest thing with it is to achieve clear text. There is also no way to transfer an audio track from your phone. Let's discuss options for solving these problems.

Audio transmission

By default, WebRTC itself allocates media channels for microphone, camera and screen sharing. However, there is no channel for audio from the phone. This may be due to the fact that the API for accessing such audio appeared only in Android 10, and WebRTC for Android was implemented long before that.

As with screen sharing, you need to ask the user for permission to receive audio. Use the API follow the link.

After that we need to work with the class AudioRecordwhich provides access to audio packet buffers. It is necessary to set up a pipeline for working with this class – request data on a separate stream and transfer it to the media track.

Then we can work according to two scenarios. The first is to mix audio from the phone into the microphone’s media stream. This solution does not require configuration from other clients. The downside is a noise canceler that can cut out music as noise. In this case, you must continue to send data when the microphone is turned off.

The second way is to allocate a separate media channel and support its reception on all clients. For group calls on VK Video, we mix all the audio on the server and get one audio stream.

Vision test

When using standard screen sharing, we will often see a fuzzy picture with text. When changing slides, it will take time to stabilize, and with low network bandwidth we won’t be able to read anything at all. Let's figure out what can be done in this case.

Tips for WebRTC

The WebRTC framework provides hints in case of problems with image clarity. By using ContentHint we can specify what we expect to see text using the command TEXT. By using DegradationPreference you can indicate the need to save permission MAINTAIN_RESOLUTION and the ability to sacrifice frame rate.

It is worth considering: hints do not guarantee that the presentation with text will have the required screen clarity.

Your own version of screen sharing

If we are not satisfied with the result of the screen sharing, we can write our own version. The hardest part will be taken care of by ready-made libraries.

The solution will have two subsystems. One will send frames, the second will receive and display. The full data path looks like this:

• we receive a frame from the camera (works within the framework of the old logic);

• encode the frame using a codec;

• cut the encoded frame into packets for sending;

• send packages;

• the other side receives packets;

• a coded frame is assembled from these packets;

• a codec is used for decoding;

• the frame is transferred to the processing pipeline;

• data is displayed on the screen.

To encode and decode frames we will use the class MediaCodecby asking him for a codec Vp9. This codec has two types of API: synchronous and asynchronous. You can use any approach, but in the case of asynchronous it is easier to monitor whether the codec has time to cope with the work – it itself issues input and output buffers when ready.

We will use WebRTC to transmit packets DataChannelsince media tracks are not available in the original transmission version. The data channel can be thought of as a WebSocket protocol, but with the ability to transfer binary data and set up a sending queue.

The codec is responsible for encoding and decoding. For data transmission – data channel. Developers are faced with the task of cutting data into packages and collecting them on site. We need to independently control congestion if we cannot cope due to heavy load or low network bandwidth. This is both a challenge and an advantage, since the resolution does not have to be reduced to pixelated text, as a standard media track does. We will first of all reduce the frame refresh rate.

As a result, we will get clear text on the video.

Statistics

I recommend measuring all experiments with audio, video or screen sharing. Without measurements, we will not know whether the quality of calls has improved or worsened.

Local debugging

For local debugging, you can use WebRTC logs. A nice option is the presence of a built-in debugging page in Chromium browsers. To see this information in Google Chrome, go to chrome://webrtc-internals. At the same time, launch VK Calls. To test, create an empty call and then open the link in incognito mode. Then there will be two participants in the call, a connection will be established and graphs of the connection parameters will appear on the debugging page.

Measurements on a large audience

Methods are available for reading media track metrics in WebRTC getStats.

One method returns a set of ready-to-use metrics, but may vary depending on the browser. For Android this will not be a problem. However, it is worth considering that the method has been marked “obsolete” and will be removed in new versions. We should avoid him.

Second method getStats standardized, although in practice not everything is implemented in strict accordance with the documentation. The current description of the characteristics can be found Here. Unlike the old method, there is “raw” data that you need to combine yourself into meaningful metrics. You also need to be careful with data types. They come in the form of strings and can be like Int, Long, Floatso BigInteger depending on the packages sent.

Conclusions

WebRTC provides basic functionality that works well out of the box. We can improve the performance of individual components using simple tweaks. If advanced features are important, you will have to write your own solutions.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *