RED: Improving sound quality with redundancy

Back in April 2020 Citizenlab reported rather weak encryption on Zoom and stated that Zoom uses the SILK audio codec. Unfortunately, the article did not contain the initial data to confirm this and give me the opportunity to refer to it in the future. However, thanks to Natalie Silvanovich of Google Project Zero and to the Frida tracing tool, I was able to get a dump of some raw SILK frames. Their analysis inspired me to take a look at how WebRTC handles audio. When it comes to perceived call quality in general, it is the audio quality that affects the most, as we tend to notice even small glitches. Just ten seconds of analysis was enough to set off on a real adventure looking for possible audio quality improvements provided by WebRTC.

I dealt with a native Zoom client back in 2017 (before post DataChannel) and noticed that his audio packages were sometimes very large compared to the packages of solutions based on WebRTC:

The graph above shows the number of packets with a specific UDP payload length. Packets between 150 and 300 bytes are unusual when compared to a typical WebRTC call. They are much longer than the packets we usually get from Opus. We suspected that there was forward error control (FEC) or redundancy, but without access to unencrypted frames, it was difficult to draw further conclusions or do something.

Unencrypted SILK frames in the new dump showed a very similar distribution. After converting the frames to a file and then playing a short message (thanks to Giacomo Vacca for very helpful blog postdescribing the necessary steps) I went back to Wireshark and looked at the packages. Here is a three-package example that I found particularly interesting:

packet 7:
e9e4ab17ad8b9b5176b1659995972ac9b63737f8aa4d83ffc3073d3037b452fe6e1ee
5e6e68e6bcd73adbd59d3d31ea5fdda955cbb7f

packet 8: 
e790ba4908639115e02b457676ea75bfe50727bb1c44144d37f74756f90e1ab926ef
930a3ffc36c6a8e773a780202af790acfbd6a4dff79698ea2d96365271c3dff86ce6396
203453951f00065ec7d26a03420496f

packet 9:
e93997d503c0601e918d1445e5e985d2f57736614e7f1201711760e4772b020212dc
854000ac6a80fb9a5538741ddd2b5159070ebbf79d5d83363be59f10ef
e790ba4908639115e02b457676ea75bfe50727bb1c44144d37f74756f90e1ab926ef
930a3ffc36c6a8e773a780202af790acfbd6a4dff79698ea2d96365271c3dff86ce6396
203453951f00065ec7d26a03420496f
e9e4ab17ad8b9b5176b1659995972ac9b63737f8aa4d83ffc3073d3037b452fe6e1ee
5e6e68e6bcd73adbd59d3d31ea5fdda955cbaef

Package 9 contains two previous packages, package 8 – 1 previous package. This redundancy is caused by the use of the LBRR – Low Bit-Rate Redundancy format, as shown by a deep study of the SILK decoder (which can be found in Internet project presented by the Skype team or in GitHub repos):

Zoom uses SKP_SILK_LBRR_VER1 but with two fallback packages. If each UDP packet contains not only the current audio frame, but also the two previous ones, it will be robust even if you lose two of the three packets. So maybe the key to Zoom sound quality is Grandma’s Skype secret recipe?

Opus FEC

How can I achieve the same with WebRTC? The next obvious step was to consider Opus FEC.

LBRR, Low Rate Reservation, from SILK, is also found in Opus (remember that Opus is a hybrid codec that uses SILK for the lower end of the bitrate range). However, Opus SILK is very different from the original SILK, which was opened by Skype, as is the part of LBRR that is used in error control mode.

In Opus, error control is not simply added after the original audio frame, it precedes it and is encoded in the bitstream. We tried experimenting with adding our own error control using Insertable Streams APIbut this required a complete transcoding to insert the information into the bitstream before the actual packet.

Although the efforts were unsuccessful, they did generate some statistics on the impact of LBRR, which are shown in the figure above. LBRR uses up to 10 kbps (or two-thirds of the data rate) bitrate for high packet loss. The repository is available link… These statistics are not displayed when calling WebRTC getStats() APIso the results were quite interesting.

In addition to the problem with the need for transcoding, Opus FEC in WebRTC, as it turned out, is configured somewhat uselessly:

It only activates on packet loss, and we would like the backup information to be kept until any problems arise. The guys at Slack wrote about this back in 2016… This means that we cannot enable it by default and protect ourselves from accidental irregular losses.
Forward error correction is limited to 25%. It is ineffective above this value.
The bitrate for FEC is subtracted from the target maximum bitrate (see here).

Subtracting the FEC bitrate from the target maximum bitrate does not make sense at all – FEC is actively reducing the bitrate of the main stream. A lower bitrate stream usually results in lower quality. If there is no packet loss that can be corrected with FEC, then FEC will only degrade the quality, not improve it. Why is this happening? The main theory is that congestion is one of the reasons for packet loss. If you are experiencing congestion, you will not want to send more data because that will only make the problem worse. However, as Emil Ivov describes in his great performance at KrankyGeek from 2017congestion is not always the cause of packet loss. In addition, this approach also ignores any accompanying video streams. The congestion-based FEC strategy for Opus audio doesn’t make much sense when you are sending hundreds of kilobits of video alongside a relatively small 50kbps Opus stream. Perhaps in the future we will see some changes in libopus, but for now I would like to try to disable it, because it is currently enabled in WebRTC by default…

We conclude that this does not work …

RED

If we want real redundancy, RTP has a solution called RTP Payload for Redundant Audio Data, or RED. It’s pretty old RFC 2198 was written in 1997… The solution allows multiple RTP payloads with different timestamps to be put into the same RTP packet at a relatively low cost.

Using RED to put one or two redundant audio frames in each packet would be much more robust against packet loss than Opus FEC. But this is only possible by doubling or tripling the audio bitrate from 30 kbps to 60 or 90 kbps (with an additional 10 kbps for the header). Compared to more than 1 megabit of video data per second, though, that’s not too bad.

The WebRTC library included a second encoder and decoder for RED, which is now redundant! Despite attempts remove unused audio-RED-code, I was able to apply this encoder with relatively little effort. Complete solution history available in the WebRTC bug tracking system.

And it is available as a trial that is included when Chrome starts with the following flags:

--force-fieldtrials=WebRTC-Audio-Red-For-Opus/Enabled/

Then RED can be enabled via SDP negotiation, it will display as follows:

a=rtpmap:someid red/48000/2

It is not enabled by default as there are environments where using extra bandwidth is not a good idea. To use RED, change the order of the codecs so that it comes before the Opus codec. This can be done using the API RTCRtpTransceiver.setCodecPreferences, as shown here… Obviously another alternative is to manually change the SDP. The SDP format could also provide a way to configure the maximum level of redundancy, but the RFC 2198 offer-response semantics were not completely clear, so I decided to postpone this for a while.

You can demonstrate how this all works by running in audio example… This is what the early version looks like with one backup package:

By default, the payload bitrate (red line) is almost double that without redundancy at almost 60 kbps. DTX (Discontinuous Transfer) is a bandwidth conserving mechanism that only sends packets when voice is detected. As expected, when using DTX, the effect of the bitrate softens somewhat, as we can see at the end of the call.

Checking the packet length shows the expected result: the packets are, on average, twice as long (taller) compared to the normal distribution of the payload length shown below.

This is still slightly different from what Zoom does, where we saw fractional reservations. Let’s revisit the Zoom packet length graph shown earlier to see a comparison:

Adding Voice Activity Detection (VAD) Support

Opus FEC sends backup data only if there is voice activity in the packet. The same should be applied to the RED implementation. For this, the Opus encoder must be changed to displaying correct VAD informationwhich is defined at the SILK level. With this setting, the bitrate reaches 60 kbps only in the presence of speech (compared to constant 60+ kbps):

and the “spectrum” becomes more like what we saw with Zoom:

Change to achieve thishasn’t appeared yet.

Finding the right distance

Distance is the number of backup packets, that is, the number of previous packets in the current one. In the process of working to find the right distance, we found that if RED at distance 1 is cool, then RED at distance 2 is even cooler. Our laboratory estimate simulated a random packet loss of 60%. In this environment, Opus + RED performed excellent sound, while Opus without RED performed much worse. The WebRTC getStats () API provides a very useful ability to measure this by comparing the percentage of hidden samples that is obtained by dividing concealedSamples on the totalSamplesReceived…

On the audio samples page this data is easily retrieved using this JavaScript snippet pasted into the console:

(await pc2.getReceivers()[0].getStats()).forEach(report => {
  if(report.type === "track") console.log(report.concealmentEvents, report.concealedSamples, report.totalSamplesReceived, report.concealedSamples / report.totalSamplesReceived)})

I ran a couple of packet loss tests using a not-so-famous but very useful flag WebRTCFakeNetworkReceiveLossPercent:

--force-fieldtrials=WebRTC-Audio-Red-For-Opus/Enabled/WebRTCFakeNetworkReceiveLossPercent/20/

At 20% packet loss and FEC enabled by default, there was not much difference in audio quality, but there was a slight difference in metric:

scenario	loss percentage
without red	18%
no red, FEC disabled	20%
red with distance 1	4%
red with distance 2	0.7%

Without RED or FEC, the metric almost matches the requested packet loss. There is an effect of FEC, but it is small.

Without RED, at 60% loss, the sound quality became rather poor, a little metallic, and the words difficult to understand:

scenario	loss percentage
without red	60%
red with distance 1	32%
red with distance 2	18%

There were some audible artifacts at RED with distance = 1, but almost perfect sound with distance 2 (which is the amount of redundancy currently in use).
There is a feeling that the human brain can withstand a certain level of silence that occurs irregularly.

(AND Google Duo appears to use a machine learning algorithmto fill the silence with something).

Measuring performance in the real world

We hope that the inclusion of RED in Opus will improve the sound quality, although in some cases it can make it worse. Emil Ivov volunteered to conduct a couple of listening tests using the POLQA-MOS method. This has been done in the past for Opus, so we have a baseline for comparison.
If the initial tests show promising results, then we will conduct a large-scale experiment on the main Jitsi Meet roll using the percentage loss metrics we used above.

Note that for media servers and SFUs, enabling RED is a little more difficult as the server may need to manage RED relaying to select clients, as if not all clients support RED conferences. Also, some clients may be on a limited bandwidth channel where RED is not required. If the endpoint doesn’t support RED, the SFU can remove unnecessary encoding and send Opus without a wrapper. Likewise, it can implement RED itself and use it when re-sending packets from an endpoint transmitting Opus to an endpoint that supports RED.

Many thanks to Jitsi / 8 × 8 Inc for sponsoring this adventure and the folks at Google who reviewed / provided feedback on the required changes.

And without Natalie Silvanovich, I would have been stuck looking at the encrypted bytes!

RED: Improving sound quality with redundancy

Opus FEC

RED

Adding Voice Activity Detection (VAD) Support

Finding the right distance

Measuring performance in the real world

Now I know how colleagues python look, or How our team maintains relations at a distance during the distance

Four programming mistakes that I only realized when I became a CTO

Once again about the performance of streams in Java

Flutter for Apple TV

“Immediately after the holidays”: seminars, master classes and technology competitions at ITMO University

I classify client-server interaction from A to Kafka

Leave a Reply Cancel reply

Opus FEC

RED

Adding Voice Activity Detection (VAD) Support

Finding the right distance

Measuring performance in the real world

Similar Posts

Leave a Reply Cancel reply