How video conferencing works

video conferencing. I will cover a wide range of topics, but will not go into detail so that the article is understandable for those who have not previously worked with video communications. My goal is to give a systematic idea of ​​video conferencing.

The article will be useful to engineers who are just starting to work with video conferencing or are implementing a third-party solution in their company, as well as technical managers and analysts who are choosing a video communication system for their organization. At the end of the article I will leave a checklist with the criteria for comparing VKS. This is the first article in the series, and in the next one I will talk about our approach to implementing video conferencing.

Video calls in T-Bank

We actively use video calls to identify clients in sensitive transactions. For example, if you want to change the phone number associated with your account or decide to transfer three salaries to a crypto exchange, then with a high probability you may be invited to a video call. It is important for us to make sure that you are really you, and that you are acting of your own free will, and not sitting in the basement with a bag on your head.

Identification via video call is faster than traditional identification through clarifying questions and works more reliably, which means it saves employee time and makes the bank safer for customers. There is an example famous case when the very presence of a video call in T-Bank saved the client’s money.

This is what the client sees during identification

This is what the client sees during identification

How video transmission over a network works

The basis of most modern videoconferencing is WebRTC, a protocol for transmitting media over the network, which Google released as open source in 2011. WebRTC is supported by all popular browsers, there is an SDK for Android and iOS. There are other protocols – for example, Zoom wrote its own.

When there are no resources to develop your own protocol, the only available option is WebRTC.

Suppose there are two devices on the network and you need to display video from the camera of one device on the screen of the other in real time. Due to the need to transmit traffic in real time We will most often choose UDP as a transport.

UDP is transmitted over a heterogeneous and unstable network, because of this we will have problems that we will have to solve.

If we were not trying to minimize delays, the problem would be solved differently. Streaming platforms, which value video quality over delays, use other protocols (for example, HLS or DASH), and I will not consider them in the article. This is a topic for a separate university course.

To send traffic in UDP, we need to know the IP and port of the receiving party, and we also need that the receiving party allows UDP to be received at this address.

You can find out your IP and port by sending a request to some server on the Internet and asking it to return your IP. In WebRTC terminology, this server, along with the protocol through which such requests are made, is called STUN. If your NAT settings allow it, then the IP returned in the response will already be able to receive UDP traffic from an arbitrary address on the network.

When with Symmetric NAT this scheme will not work: you will need a proxy server that will receive traffic from the sender and forward it to the final recipient behind NAT. Such servers in WebRTC are called TURN. The whole mechanism of finding a route for transmitting traffic is called ICE, and the IP-port pair is an ICE candidate. Transferring ICE candidates between participants in a WebRTC session is the user’s task; usually, ready-made websocket-based applications are written or used for this.

WebRTC operation scheme

WebRTC operation scheme

After exchanging addresses for receiving traffic, you can transmit video and audio. The quality of the network during real-time communication has a decisive influence on the perceived quality of the picture and sound, so WebRTC has a built-in set of mechanisms to compensate for network problems.

The first network problem is delays in data transfer. In WebRTC, they are usually measured using RTT – round trip time. The second problem is packet loss, packet loss measured as a percentage.

During a call, both problems overlap each other and lead to a jitter effect: packets uniformly sent from the source arrive at the receiving end with an uneven delay and out of order. To compensate for jitter, a jitter buffer is used: packets accumulate in memory for some time before playback, waiting for lagging ones.

You can also fight packet loss: there is a redundant coding approach FEC which improves reliability by increasing the bitrate. The receiving side can re-request missing packets by sending NACKs – negative acknowlegement. If lost packets still cannot be recovered, then you can try to hide their loss on the receiving side using mechanisms PLC.

Visual demonstration of the jitter effect

Visual demonstration of the jitter effect

There are also compromises in the world of codecs (algorithms for converting a media stream into bytes and back). On the one hand, you can properly warm up the processor and get a perfectly compressed video that can be transmitted without problems even over a limited communication channel; on the other hand, you can encode less and therefore generate a higher bitrate. For example, modern video codecs VP9 and AV1 compress video better, but spend more processor resources on it than the more classic H264. You also need to keep in mind that not all users will support the codec you need and that excessive CPU load can lead to freezes on the client. It will help to dive deeper into tradeoffs between codecs article on MDN.

Server topologies

WebRTC gives the developer everything necessary to transmit a media stream in a point-to-point mode, but does not offer any single mechanism for building large video conferences. If there are more than two participants in a call, there are several topology options:

  • A mesh in which every participant is directly connected to every other. Its only advantage is low latency, the main obvious disadvantage is the linear growth of transmitted and received traffic on the client depending on the number of participants, which is why in practice the quality of the call degrades somewhere around the connection of the 8th user. The use of this topology is usually limited to one-on-one calls.

  • MCU, in which each participant sends their streams to the server, where the audio streams of all participants are mixed into one and all video streams are combined into a “mesh”, so that each participant receives one audio stream and one video stream.

    The tradeoff here is simple: due to the intensive load on the server CPU, all call participants save traffic and processor time. We remember that each stream must not only be received, but also decoded ☝️

  • SFU, where the server is engaged in forwarding streams: the participant sends his streams to the server and subscribes to receive the streams of other participants he needs.

Server topologies

Server topologies

In practice, more opportunities for horizontal scaling are provided by the symbiosis of SFU and MCU: SFU for video, MCU for audio. From the definition of SFU, it follows that this topology is fundamentally suitable for horizontal scaling: if all the flows within a call do not fit on one server, then some of the participants begin to publish their flows to the neighboring one, while nothing changes from the point of view of receiving flows on the client. The client also requests videos of the participants he needs, but some of them come to him from another server.

SFU can also be used for audio, but here video conferencing developers usually optimize traffic on the client. Audio mixing on a server is a relatively easy process and does not require the same colossal CPU resources as video mixing (which, by the way, is why video MCUs are not widely used).

Due to the fact that each square in a video conference is a separate video stream, video conferencing developers use every possible trick: they limit the maximum number of participants with the camera turned on displayed in the interface, and the more displayed

participants, the worse the video quality received from the server will be. It also helps that a lot of video squares usually simply don’t fit on the screen.

By the way, about video quality: so that the client, depending on the current state of the interface, can dynamically switch between the good and bad quality of the video of another participant, the server usually provides the ability to receive the same stream in different qualities. There are three main approaches to implementation: simulcast, in which each user sends video to the server in several qualities at once in separate streams; SVC (scalable video coding), where video is encoded in several qualities within one stream; approach with server transcoding, when the client transmits video to the server in the maximum available quality, then this video is converted into several streams with different qualities. The tradeoff here is clear: the overhead costs of encoding video into several resolutions at once will be borne by either the client (in the case of simulcast and SVC) or the server.

Screen demonstration

It's easy to encode screen video transmission via WebRTC. You need to get a stream from the screen from the browser API or OS and insert it into the WebRTC connection as another video stream. For other participants, as well as for the media server, this will be another video stream, indistinguishable in properties from video from the camera. Sounds simple. But such a scheme will not work very well.

If you broadcast the screen just as another video, the picture will turn out soapy, and the small font on the displayed page may become completely unreadable. When sharing the screen, it is important for us to maintain picture quality. Fortunately, although it is desirable to have a high FPS, it is not necessary, because the image on the screen is mostly static. Therefore, all solutions for high-quality screen broadcasting are based on this tradeoff: video quality increases by reducing FPS. In its basic form, this can be implemented using WebRTC tools, using the so-called data channels API (video is transmitted via SCTP), but industry giants usually go even further and, in addition to data channels, implement custom stream encoding.

Virtual backgrounds

The history of virtual backgrounds repeats the history of screen sharing: it will be easy to make it work, but you will have to try to bring the quality to an acceptable level.

The process of generating a virtual background consists of two stages:

  1. Select a silhouette on a frame from the video stream.

  2. Draw the resulting silhouette on a different background.

The simplest implementation of this algorithm in a client browser looks something like this:

  1. Split the original stream from the camera into frames.

  2. Take a ready-made neural network for highlighting silhouettes, for example Google MediaPipeand highlight the silhouette through it in each frame.

  3. Draw a silhouette on top of the background using canvas.

  4. Combine the received frames into a new stream, which is transmitted to the server.

It's quick to code; writing and debugging took me a couple of hours. The caveat here is the low performance of the solution. The fact is that for a standard video with a frequency of 25 FPS, we have only 40 ms to process one frame. This will be enough for the developer’s brand new MacBook, but a budget smartphone may not be able to handle it.

We remember that in a video conference, several tracks are already encoded and decoded in parallel, and then the neural network is launched – and the application begins to drive frames from the video card to memory and back.

So why not take the entire computing load to the server? It’s a fair question, but for some reason none of the common video conferencing systems I know of have done this. Maybe it’s a pity for server capacity or some other unobvious problems arise.

Instead of transferring the load to the server, video conferencing developers optimize on the client. The first optimization point is model performance. You can move away from Google MediaPipe towards less functional, but more productive in the task of highlighting the silhouette of models. The second optimization is to use the WebGL API in the browser to draw directly on the video card.

A little about noise reduction

Noise reduction is a mandatory feature for video conferencing. Traditionally, it can be implemented in two ways: use packaged tools to achieve mediocre results, and invest development resources to achieve high quality.

Browsers, for example, have noise suppression built in; to enable it, you need to specify the noiseSupression flag when calling getUserMedia(). The flag is supported everywhere except Safari. It will work, but the quality will be unpredictable. So to implement noise reduction, people trained different neurons, for example RNNoise and DTLN. They can be embedded on the client.

To increase sound quality, you need to use VAD – this will not only reduce the traffic sent to the server, but also improve the sound quality. If there is no human voice in a piece of audio, it will simply be ignored, and no noise reduction will be required. You can also either leave the VAD implementation out of the box or add your own.

To complete the picture, I note that a high-quality audio processing pipeline should also include A.E.C. cutting out from the audio what was being played by the speaker a moment ago, and A.G.C. which equalizes the volume of call participants so that it does not have to be adjusted manually.

Call recording

There are two important facts you need to know about working with video conference recordings:

  • Combining multiple tracks into one file is always a CPU (GPU) intensive operation.

  • Storing recordings requires a lot of space. First, during a call, you need to record raw files with tracks, then you need to put the final converted file somewhere.

Some video conferencing systems have the problem of recording on the client, for example Zoom. Anyone who has used their desktop client has seen a window with a progress bar for converting sources into the final file, which appears after the call ends.

Server recording can work in different ways. For example, the Janus WebRTC server dumps tracks to disk in uncompressed form during a call. You can then convert them yourself using ffmpeg. This approach allows you to move the computationally intensive conversion operation from the server with the call to a separate machine.

There are other approaches to server recording. For example, LiveKit mirrors all participant tracks to a separate server where conversion occurs. This method has one drawback: in fact, all traffic in the call is doubled in this case. It is necessary to closely monitor the utilization of the network channel; for SFU it usually becomes a bottleneck.

And, of course, video conversion on the GPU is much more efficient than on the CPU: in our tests, the speed increase was 11 times.

Running ffmpeg to convert video on the CPU will be a stress test for a data center cooling system

Running ffmpeg to convert video on the CPU will be a stress test for a data center cooling system

Videoconferencing implementation schemes

There are three options to implement video calls.

Buy a license from Zoom or its analogues. Domestic solutions are searched for under the query “VCS for business”. Moreover, usually such videoconferencing can not only be used in the cloud, but also installed on-premise. The main advantage here is that you immediately get a high-quality ready-made solution, which, in addition to video conferencing, includes other useful features: broadcasts, integration with chat rooms, the ability to draw on the screen, and so on.

In my opinion, there are two main disadvantages: licenses for video conferencing are expensive and limited to the out-of-the-box functionality of the selected video conferencing system. You won’t be able to seamlessly integrate a video call into your interface or add an arbitrary feature. Yes, some have VKS there is an SDK for embedding into interfaces, but for some, even this flexibility may not be enough. Also, when installing on-premise, you need to budget for hardware costs, because making calls and especially converting and storing records are resource-intensive processes.

It is important to consider that media servers work best on hardware servers, because each additional layer of virtualization (VM hypervisor, Docker container) adds potential delays when processing real-time media streams. And when installing on a local network, you will have to tinker with the settings of NATs and firewalls, debug UDP drops, and so on – this can take a long time for a team of network engineers.

A common scheme for introducing market video conferencing into your application

A common scheme for introducing market video conferencing into your application

Install an open-source media server in your infrastructure, for example LiveKit or Jitsi, and use ready-made SDKs that come included on the client side. A bonus will definitely be the overhead costs of setting up and maintaining the infrastructure.

The solution will have its own limit of flexibility: open-source projects are tailored to the scenario of a conference call for a couple of dozen participants, and doing something outside of this scenario is not easy. For example, it will be difficult to efficiently implement a call with a prompter – a participant who sees and hears everyone in the call, but only selected participants can hear him.

The client SDK will in any case display the prompter to all participants, so here you will have to do the logic on the client side, which is unsafe for such a sensitive feature and delays the development processes if you have several clients for different platforms.

It is important that open-source projects may have problems with the quality and performance of virtual backgrounds, screen sharing and noise reduction. The SDK may not have these functions, but if they do, they are usually implemented in a boxed way and therefore are inferior to paid analogues. In addition, the capacity of one video conference for such solutions is limited by the resources of one server; it will not be possible to make a conference for 500 participants.

In general, an open-source media server is a good solution if you need to quickly and inexpensively integrate small video conferences into an application without pretense of high quality and high customization ability.

Usually open-source media servers assume this use case

Usually open-source media servers assume this use case

Be determined and write everything yourself. Well, or not quite everything, but write your own backend layer around the open-source media server and make your own client SDKs. Before you embark on this slippery slope, you need to think three times about what you need that the conventional Jitsi cannot provide.

In my opinion, three main factors must coincide: full control over the quality of the solution, the ability to integrate arbitrary features and the availability of a budget for the development team, which will grow in proportion to the required functionality and quality. We went this way.

How we implemented it

How we implemented it

Conclusion

A small list of characteristics that you need to pay attention to when choosing a video conferencing system:

  • Video quality. It makes sense to test the call under poor Internet conditions and on weak devices, to observe the quality and synchronization of audio with video in a call with 100+ participants, if the video conferencing system allows such calls. You can ask SaaS solutions for their presence in the regions: if Khabarovsk users make calls through Moscow, they will notice it and won’t say thank you.

    There is no point in demanding FHD quality from VKS: no one can do that due to the physical limitations of the user's bandwidth. But 640×480 video should work stably.

  • Audio quality. Here we test noise reduction. You can do an A/B test with Zoom: hold two calls in some noisy place and then listen to the result on the recording.

  • Reliability. If the client's network is temporarily unavailable or changes, such as after switching from Wi-Fi to LTE or riding in an elevator, the client should be able to reconnect to the call without any problems.

  • Screen sharing quality. Here you can run a benchmark by showing a text document with a small font: flip through it back and forth and compare the quality with Zoom. It’s also worth checking out the demo on mobile clients.

  • Quality of virtual backgrounds. Again, you can compare it to Zoom: try different lighting, different distances from the screen. How will the background behave if there are two people in the frame? How does CPU and GPU consumption increase in the browser and on the desktop/native client?

  • Other features. Usually in market video conferencing you can find a number of useful options around the call functionality. For example, drawing on the screen or in a group whiteboard, controlling a remote computer, the ability to split one room into several or launch voting. For corporate use, integration with SSO and Outlook will be useful, as well as the ability to integrate into chat rooms.

    There are more exotic features, such as the ability to connect to a conference via a GSM call or enable automatic transcription of text on a recording.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *