Video conferencing – how to deal with high CPU utilization?
My name is Alexey Doilnitsyn, I am an architect at DINS. Our company participates in the development of the RingCentral UCaaS-platform (Unified Communication as a Service), which is used by more than 400 thousand companies around the world.
I work in a team that is responsible for the development of a video conferencing service – RingCentral Video or RCV. Video conferencing with a large number of gallery participants is often overwhelming for older laptops. We solved this problem using the theory of automatic control systems (ACS).
RCV became especially relevant in early 2020 due to the massive shift to work from home. At the same time, we faced an interesting problem: some users complained about the constant hum of the fans, the rapid consumption of the battery and the “braking” of the device. It turned out that conferences with a large number of participants in the gallery can be overwhelming for the outdated laptops that are often found at home. And we began to solve this problem – with the help of the theory of automatic control systems.
Problem and approach to solution
In our experience, Intel dual-core processors over 5 years old already often fail to cope with typical corporate conferences with 8-16 participants.
4-core modern Core i5 is probably the minimum configuration for such tasks. Weaker computers have problems like high battery drain and computer freeze.
To solve this problem, I decided to use the theory of automatic control systems (ACS). It is commonly used for tasks such as stabilizing a tank tower or maintaining a constant temperature in a melting furnace. ACS provide reliability and stability of control.
Instead of forcing the user to select key system parameters (resolution and number of video streams, bit rate, etc.), the software system can adapt itself to external conditions (CPU load, network bandwidth), automatically choosing the most optimal parameters …
Building a control system
Control object in our case, these are video streams drawn on the user’s screen. CPU utilization is primarily affected by the number of streams and the video resolution (height x width).
9 video streams in 640×360 resolution:
We select the target corridor of CPU utilization, for example, 40-50%.
As mistakes use the difference between the current CPU load and the nearest corridor boundary. For example, the current load is 80%, then the error is 80-50 = 30 in absolute values, or in relation to the corridor: 30/50 = 0.6 (60%).
The basic logic of the control system is simple. The system responds to incoming information events, in which the current CPU load is transmitted. Events come with a sufficiently large interval of 10 seconds to take into account the inertia of the system for working out control actions.
If the CPU load is higher than the target corridor, then first we reduce the resolution of streams, then we start turning off streams.
If the CPU load is below the target corridor, then first turn on the disabled streams, then increase their resolution.
If the load is inside the corridor, then we do nothing – to avoid hesitation.
In this case, the number of streams that will be affected is calculated from the magnitude of the error. For example, the client renders 16 video streams at an average resolution of 640×360. If the CPU load is 80% and, accordingly, the error is 0.6, then in one step we will reduce the resolution for 16 * 0.6 = 10 threads. This is done so that the system responds more quickly to large disturbances. At the same time, we take into account that rendering of one thread takes from fractions to several percent of the total system load.
Problems arising from operation
Inside the corridor
During testing, we encountered a situation when the system, being inside the corridor, receives external disturbance in the form of an antivirus, which significantly increases the CPU load for some time. Since our control system does not distinguish between external and internal influences, it regularly responds to an increase in load by disabling video streams. After the antivirus has finished working, the system goes back into the corridor, but with the video streams turned off. While in the corridor, the system does nothing to avoid undamped oscillations. This behavior looks strange to the user.
Decision: still enable video streams inside the corridor, but slowly, for example, one stream for every third call.
On a powerful computer, video stream processing takes up a fraction of a percent of the total CPU load, and it is pointless to disable video streams, since they practically do not affect the system load.
Decision: measure how much CPU our process is consuming as a percentage of the total system consumption. If the share of our process is less than 50%, then turn off the control system.
The final test on an old laptop with an i5-3210M processor (4 cores @ 3 GHz) involved 8 participants (bots) at 320×180 resolution. The starting load of the CPU is about 80%.
After 30 seconds, four participants are disconnected, the load drops to 65%.
There is nowhere to go below, since the minimum number of participants on the screen is 4.
As a result, the control system successfully passed the tests and was put into commercial operation.
If you have any questions or would like to know something more about RCV – write in the comments.