How to remove objects from video using ML
A new study from China reports strong results and impressive performance gains for a new retouching system that makes it easy to remove objects from videos. Details before launch flagship course in Data Science.
Check out the demo:
The paraglider’s suspension is painted using a new technique. See video (at the end of the article): there is a better resolution and there are examples.
This multipurpose streaming video retouching technology E2FGVI can also remove various blackouts and watermarks from video content.
Although the model from the published work was trained on video with dimensions of 432 x 240 pixels (usually with small input sizes, limited by available GPU space compared to optimal batch sizes and other factors). The authors have since released E2FGVI-HQcapable of handling video of arbitrary resolution.
Code of the current version of the solution available on GitHub, and the HQ version can be downloaded from Google drive and Baidu Disk.
E2The FGVI is capable of processing 432 x 240 video at 0.12s per frame on a Titan XP GPU (12GB VRAM). According to the authors, the system is 15 times faster than previous best practices based on optical flow.
The new method has been tested on standard datasets for this type of image synthesis study. He outperformed his competitors both in terms of quality and quantity.
Work called Towards An End-to-End Framework for Flow-Guided Video Inpainting. Its authors are four researchers from Nankai University and another from Hisilicon Technologies.
What is missing from this picture?
In addition to its obvious application in visual effects, high-quality video retouching should be the main, defining functionality of new AI-based image synthesis and modification technologies.
This is especially true for fashion applications related to body modification and other frameworks, aimed at losing weight or some other change of scenes in images and videos. In such cases, it is necessary to “fill in” the additional background detected during synthesis so that the filling looks plausible.
Coherent optical flow
Optical flow has become the main technology in the work on removing objects from video. As in atlas, a frame-by-frame map of the time sequence is used here. Optical flow is often used to measure speed in software projects. computer vision. And it can provide time-sequential retouching, where the overall outcome of a task can be viewed in a single pass, rather than with attention to each frame (as in the Disney style), which inevitably leads to temporal heterogeneity.
Video retouching methods today are based on a three-step process:
Terminating a threadwhen the video is actually compared with a discrete and accessible entity;
Pixel Spreadwhen holes in “corrupted” videos are filled with pixels that propagate bidirectionally;
Content hallucination (“invention” of pixels, familiar to most of us from deep fakes and text-to-image frameworks such as the DALL-E series) where supposedly “missing” content is invented and inserted into the video.
Central Innovation E2FGVI is a combination of these three steps into a multi-purpose system that eliminates the need for manual operations in the content or process.
Quote from work:
“Take as an example DFVI. To complete one 432 × 240 video from DAVIS, which contains about 70 frames, takes about four minutes. In most real applications, this is unacceptable. In addition to the above disadvantages, when using a pre-trained image retouching network only at the content hallucination stage, the content relationships between neighbors in time are not taken into account. This leads to the generation of inconsistent content.”
Combining the three steps of video retouching, E2FGVI is capable of replacing the second stage, pixel propagation, with feature propagation. In the more segmented processes of previous work, the features are not as readily available because each step is relatively sealed and the workflow is only semi-automated.
In addition to this, the researchers created for the stage of hallucination content temporary focal transformerwhich takes into account not only the nearest pixels in the current frame (that is, what happens in that part of the frame in the previous or next image), but also those that are separated from them by many frames, but still affect the overall effect of any operations, performed throughout the video.
The feature-based workflow core is able to take advantage of feature-level processes and learnable sample biases, and in the project’s new focal transformer, the dimension of focal windows is increased, according to the authors, from “2D to 3D.”
Tests and Data
To test E2FGVI, the researchers analyzed the system against two popular data sets for object segmentation in video: YouTube-VOS and DAVIS. YouTube-VOS contains 3741 training videos, 474 validation clips and 508 test clips, while DAVIS contains 60 training videos and 90 test clips.
E2FGVI was trained on YouTube-VOS and analyzed on both datasets. To simulate video completion, object masks were generated during training (green areas in the images above and in the video at the end of the article).
As metrics, the researchers used Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), Video Fréchet Distance (VFID), and Stream Distortion Error, the latter to measure video temporal stability.
Here are the previous architectures on which the system was tested: VINet, DFVI, LGTSM, cap, FGVC, STTN and FuseFormer.
Having obtained the best results compared to all competing systems, scientists conducted a qualitative study of users. Videos converted using the five methods presented were shown to each of 20 volunteers who were asked to rate their visual quality.
The authors note that, despite the unanimous preference for their method by the respondents, one of the results (FGVC) quantitatively stands out from the overall picture. In their opinion, this indicates that in E2FGVI supposedly can generate “visually pleasing results”.
In terms of efficiency, the authors note that their system significantly reduces the number of floating point operations per second (FLOPs) and inference time on a single Titan GPU in the DAVIS dataset. They also note that E2FGVI is 15 times faster than streaming methods.
Here is their comment:
“[E2FGVI] performs the fewest floating point operations per second compared to all other methods. This indicates that the proposed method is highly effective for video retouching.”
And we will help you improve your skills or master a profession that is relevant at any time from the very beginning:
Choose another in-demand profession.