Virtual camera for WebRTC

A short introduction

Hardly anyone has avoided using video conferencing. At the appointed time, we check our Skype look, sit down comfortably and communicate with colleagues.

Strictly speaking, with all the variety of platforms, there are few options: either a webcam or colleagues watch your initials on the screen for an hour. Using a camera is not always convenient, and embarrassments do happen. The second option looks somewhat unfriendly…

What other options? Yes please!

  • If it is possible to patch the frontend, we launch a lightweight model to blur or replace the background. True, this does not save you from a sleep-deprived face

  • You can approach the task like an adult: under Linux, for example, we run v4l2loopback and draw whatever you want into a virtual webcam, depending on your imagination and hardware capabilities. If the GPU allows it, you can patch, for example, rad-nerf and put your speech into Obama’s mouth, or even secretly record your boss on video and use it to train the model in a few hours. During rendering, however, there will be a fair amount of delay, and you will either have to put up with lip-syncing, or use it in virtual audio devices, or degrade phonemic animation by using chalk spectrograms. And it would be good if your laptop GTX 4060 did not overheat during a call

  • But what if the GPU in the laptop is weak? It doesn’t matter either, we loop our video, and wav2lip will add lip movements on it, thickly covering up the errors. Beauty!

True, such approaches have problems with scalability. And, in general, all this is somehow a dead end.

And we will go the other way!

So, there is a desire to talk about our own virtual technology () personalized() cameras for WebRTC

.

Let’s start with the last word
. If you don’t use absolutely dinosaurs like skype, then with a probability tending to one, you encounter WebRTC on a weekly basis.

  1. Strictly speaking, WebRTC is a whole host of technologies. This includes working with local devices, digging through all sorts of nat devices, monitoring the connection, and various problems with audio-video encoding, and encryption..

  2. But, in the context of this task, it is important for us that

The JS code in the browser captured the user’s audio device (getUserMedia).

The peerconnection should contain audio and video tracks that are synchronous with each other

And in the middle is all the magic. We’ll talk about her.

Our camera

virtual, because we do not use the video camera of your device, but we provide the video. Where will the video track come from? Weird question. From canvas, of course! Because for WebRTC, by and large, it doesn’t matter where the video track came from: from a camera, from a canvas, or from a screen. Thus, we need to animate the 3D model and render the frames into canvas.

A few words about animation

  1. Roughly speaking, animation can be divided into two parts: “background”, not related to the spoken sounds (micro-movements and blinking of the eyes, head turns, movement on the forehead and bridge of the nose, Adam’s apple, etc.), and phonemic, corresponding to the pronounced sound. Obviously, for the first part of the animation you can use pre-prepared patterns, for the second it is necessary to calculate some parameters in real time (or close to it), which we do using lightweight models.

  2. So, with the frame rate, we must synthesize pictures based on the 3D model and input vectors that describe the state of the animation patterns at a given time.

  • Meshes aside, there are two main ways to render a 3D model:

  • based on parametric model (riga)

  • morphing with weights determined by the parameters above. This is exactly what we do:

  • identify the user (you don’t want some troll using your model?)

download from the Flexatar cloud – relatively speaking, a file with a data set for image based rendering. You can roughly imagine this as a set of images of a 3D model from different angles (just in case, there is no such word in English) and with different snapshots of facial expressions. Yes, autocorrect is still trying to correct the word Flexatar, but we will do everything possible to ensure that, together with our technology, it becomes widespread

We send the audio from the microphone for processing, add an animation to the resulting vector, “mix” the images and draw them with WebGL.

The audio track, delayed by part of the sliding window for calculations, is sent along with the video from the canvas.

After this, the unsuspecting SFU server, which distributes tracks to participants inside the room, will send your media to colleagues, and they will see Krosh or Nyusha from Smeshariki speaking in your voice.

What about the delay? It’s okay, your colleagues won’t notice anything. We can talk on the phone with people on the other side of the planet, although the delay is higher, especially when using satellite channels and codecs.

Delay is an echo? No, too: echo is a reflection from the “far end,” that is, an acoustic loop of your voice from your colleagues’ speakers into their microphones, and WebRTC fights this in a standard way.

As for the processing time delay, it’s always a trade-off. One extreme – the entire phrase is processed, text recognition occurs with correction based on the language model. Next, forced alignment is used to link the correct phonemes to time, and the quality of the animation is limited only by the professionalism of the algorithm developer. For example, in order for lips not to slap, they must “stick together” and stretch when they come apart. But we are not talking about cinema, and the task is to create a comfortable illusion of a living interlocutor in real time, and we cannot afford correction. The quality is deteriorating somewhat, and this is a price to pay for time: “- There is no perfection in the world! – Fox sighed

Creation of 3D models

We don’t always want to talk to colleagues in the image of the cheerful Kopatych. Let’s move on to the most interesting part:

personalized. Over the course of 5 years, we developed our own algorithm for creating a 3D model, on the basis of which Flexatar is created. Yes, if there is a need, we can export a 3D object in the form of a mesh of the required polynomiality and textures, but for real-time communication tasks this is not necessary.

The calculations require a GPU, but on the 3080 it takes about 15 seconds with a ridiculous load. To create a flexatar you will need 5 photos: one frontal and 4 with a slight (which is important) turn of the head. And this can be done directly from the browser, and at the end of the article there is a link to git on how to make the model speak. For better quality, our testflight on iOS also has the ability to create a model of the jaws and fit them into the desired place on the head, which significantly improves visual perception. But from the web it also works out quite reasonably, in our opinion.

Since this is 3D, we can not only create flexatars, but also mix them, adding subtle associations. But that is another story. Gitas and Invitations

Sources for animating audio recordings in the browser: https://github.com/dmisol/sfu2/blob/main/static/webgl.html Git with virtual webcam:

https://github.com/dmisol/flexatar-virtual-webcam In addition, we have a test server running,

https://flexatar.com

with a server version of rendering, where SFU on the PION library (GOLANG) decodes OPUS audio for processing and assembles freshly rendered frames into an H264 video track. On this server you can create your own flexatar, chat with other people (if they are online), and also watch browser animation on WebGL. *Naturally, when he is alive and not used for experiments
SFU sources here
https://github.com/dmisol/sfu2

All work was carried out over 5 years by a very small team, without funding and, in general, in their free time from their main work (successfully avoiding conflicts of interest).Currently, in addition to improving quality, we are engaged inporting the server part to AWS and, partially, Yandex cloud.We are planning integrations with various SFUs, for starters, Janus and Livekit.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *