How to remove objects from video using ML


A new study from China reports strong results and impressive performance gains for a new retouching system that makes it easy to remove objects from videos. Details before launch flagship course in Data Science.


Check out the demo:

The paraglider’s suspension is painted using a new technique. See video (at the end of the article): there is a better resolution and there are examples.

This multipurpose streaming video retouching technology E2FGVI can also remove various blackouts and watermarks from video content.

For content hidden behind obscurations, E2FGVI calculates predictions to remove even noticeable watermarks and those that are not removed in any other way.  Source: https://github.com/MCG-NKU/E2FGVI
For content hidden behind obscurations, E2FGVI calculates predictions to remove even noticeable watermarks and those that are not removed in any other way. Source: https://github.com/MCG-NKU/E2FGVI

Although the model from the published work was trained on video with dimensions of 432 x 240 pixels (usually with small input sizes, limited by available GPU space compared to optimal batch sizes and other factors). The authors have since released E2FGVI-HQcapable of handling video of arbitrary resolution.

Code of the current version of the solution available on GitHub, and the HQ version can be downloaded from Google drive and Baidu Disk.

The child remains in the frame
The child remains in the frame

E2The FGVI is capable of processing 432 x 240 video at 0.12s per frame on a Titan XP GPU (12GB VRAM). According to the authors, the system is 15 times faster than previous best practices based on optical flow.

Tennis player suddenly disappears
Tennis player suddenly disappears

The new method has been tested on standard datasets for this type of image synthesis study. He outperformed his competitors both in terms of quality and quantity.

Comparison with previous approaches.  Source: https://arxiv.org/pdf/2204.02663.pdf
Comparison with previous approaches. Source: https://arxiv.org/pdf/2204.02663.pdf

Work called Towards An End-to-End Framework for Flow-Guided Video Inpainting. Its authors are four researchers from Nankai University and another from Hisilicon Technologies.

What is missing from this picture?

In addition to its obvious application in visual effects, high-quality video retouching should be the main, defining functionality of new AI-based image synthesis and modification technologies.

This is especially true for fashion applications related to body modification and other frameworks, aimed at losing weight or some other change of scenes in images and videos. In such cases, it is necessary to “fill in” the additional background detected during synthesis so that the filling looks plausible.

In one of the latest works, the body shape reshaping algorithm was tasked with retouching the background that appeared when the object was resized.  Here it is indicated in red as part of the figure of the real girl in the image on the left.  Based on the original source from https://arxiv.org/pdf/2203.10496.pdf
In one of the latest works, the body shape reshaping algorithm was tasked with retouching the background that appeared when the object was resized. Here it is indicated in red as part of the figure of the real girl in the image on the left. Based on the original source from https://arxiv.org/pdf/2203.10496.pdf

Coherent optical flow

Optical flow has become the main technology in the work on removing objects from video. As in atlas, a frame-by-frame map of the time sequence is used here. Optical flow is often used to measure speed in software projects. computer vision. And it can provide time-sequential retouching, where the overall outcome of a task can be viewed in a single pass, rather than with attention to each frame (as in the Disney style), which inevitably leads to temporal heterogeneity.

Video retouching methods today are based on a three-step process:

  1. Terminating a threadwhen the video is actually compared with a discrete and accessible entity;

  2. Pixel Spreadwhen holes in “corrupted” videos are filled with pixels that propagate bidirectionally;

  3. Content hallucination (“invention” of pixels, familiar to most of us from deep fakes and text-to-image frameworks such as the DALL-E series) where supposedly “missing” content is invented and inserted into the video.

Central Innovation E2FGVI is a combination of these three steps into a multi-purpose system that eliminates the need for manual operations in the content or process.

The paper notes that the need for manual intervention implies not using the GPU in older processes, which makes them rather time-consuming.
The paper notes that the need for manual intervention implies not using the GPU in older processes, which makes them rather time-consuming.

Quote from work:

“Take as an example DFVI. To complete one 432 × 240 video from DAVIS, which contains about 70 frames, takes about four minutes. In most real applications, this is unacceptable. In addition to the above disadvantages, when using a pre-trained image retouching network only at the content hallucination stage, the content relationships between neighbors in time are not taken into account. This leads to the generation of inconsistent content.”

Combining the three steps of video retouching, E2FGVI is capable of replacing the second stage, pixel propagation, with feature propagation. In the more segmented processes of previous work, the features are not as readily available because each step is relatively sealed and the workflow is only semi-automated.

In addition to this, the researchers created for the stage of hallucination content temporary focal transformerwhich takes into account not only the nearest pixels in the current frame (that is, what happens in that part of the frame in the previous or next image), but also those that are separated from them by many frames, but still affect the overall effect of any operations, performed throughout the video.

E2FGVI architecture
E2FGVI architecture

The feature-based workflow core is able to take advantage of feature-level processes and learnable sample biases, and in the project’s new focal transformer, the dimension of focal windows is increased, according to the authors, from “2D to 3D.”

Tests and Data

To test E2FGVI, the researchers analyzed the system against two popular data sets for object segmentation in video: YouTube-VOS and DAVIS. YouTube-VOS contains 3741 training videos, 474 validation clips and 508 test clips, while DAVIS contains 60 training videos and 90 test clips.

E2FGVI was trained on YouTube-VOS and analyzed on both datasets. To simulate video completion, object masks were generated during training (green areas in the images above and in the video at the end of the article).

As metrics, the researchers used Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), Video Fréchet Distance (VFID), and Stream Distortion Error, the latter to measure video temporal stability.

Here are the previous architectures on which the system was tested: VINet, DFVI, LGTSM, cap, FGVC, STTN and FuseFormer.

From the section of the work devoted to quantitative results.  The up and down arrows indicate that higher or lower numbers are better, respectively.  In E2FGVI, the best results are achieved everywhere.  Methods are evaluated by FuseFormer, although DFVI, VINet and FGVC are not multipurpose systems, so evaluation of their floating point operations per second is not possible
From the section of the work devoted to quantitative results. The up and down arrows indicate that higher or lower numbers are better, respectively. In E2FGVI, the best results are achieved everywhere. Methods are evaluated by FuseFormer, although DFVI, VINet and FGVC are not multipurpose systems, so evaluation of their floating point operations per second is not possible

Having obtained the best results compared to all competing systems, scientists conducted a qualitative study of users. Videos converted using the five methods presented were shown to each of 20 volunteers who were asked to rate their visual quality.

The vertical axis shows the percentage of participants who preferred the E2FGVI outcome
The vertical axis shows the percentage of participants who preferred the E2FGVI outcome

The authors note that, despite the unanimous preference for their method by the respondents, one of the results (FGVC) quantitatively stands out from the overall picture. In their opinion, this indicates that in E2FGVI supposedly can generate “visually pleasing results”.

In terms of efficiency, the authors note that their system significantly reduces the number of floating point operations per second (FLOPs) and inference time on a single Titan GPU in the DAVIS dataset. They also note that E2FGVI is 15 times faster than streaming methods.

Here is their comment:

“[E2FGVI] performs the fewest floating point operations per second compared to all other methods. This indicates that the proposed method is highly effective for video retouching.”

And we will help you improve your skills or master a profession that is relevant at any time from the very beginning:

Choose another in-demand profession.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *