SAM2 Disassembly Through the Knee to the Head or Revolution in Video Markup

Recently, a new version of the video segmentation model was released – SAM2, which not only became faster-higher-stronger than its predecessor, but also aimed to change the entire process of video tagging, as the first version of the model did with pictures.

We use the original SAM for marking on a fairly industrial scale (including for video), and therefore it was impossible to pass by the dissection of SAM2, but since the model has already been discussed in superficial terms in TG channels, the paper is good, and the fact that the model is phenomenal is clear without words, then I will try to analyze in more depth the preparation of the dataset/markup and the model itself using complex examples with my comments.

SAM2 knocks out other models

SAM2 knocks out other models

Light reading and lots of gifs – just the thing for a lively start to Monday!

Dataset preparation

The model was trained on a video dataset that was labeled using the original SAM in the first step.

How is segmentation markup done for videos on average?

The video is broken down into independent frames, which the markers then work with. After that, an object is selected on the first required frame (using boxes, polygons, a bit mask – a brush). Then the marker moves on to the next frame, where adjustments are made to the selection of the object.

In real life, the markup artist does not physically move to the next frame, but to some +N frames, since marking each frame is very labor-intensive and expensive. The frames on which adjustments were made are called “key”, all frames between the key are “intermediate” (for them, interpolation is built between two key, if required).

Here is a visual example from our platform on bbox for better understanding: first I click on several key frames (the box is already selected there), then I start the video with interpolation. It is not very accurate on three key frames, because this is the whole point – to see where the calculated intermediate frame diverges from reality and correct it, making it a key one.

Screenshot from our elementary platform

Screenshot from our elementary platform

The authors of the model created the dataset as follows:

The videos were collected by crowdsourcers and had to be as diverse as possible: by country, event, object, etc. The average video length was about 14 seconds. A funny coincidence with the average length of videos from the subsidiary Instagram, by the way.

Markup of collected videos

After collecting the video, we began marking it up.

Phase 1: The video was divided into frames, each frame was processed through SAM as a picture, after which the taggers made adjustments where necessary. They selected everything that could be selected – both objects and important components of them. In this phase, 16K masks were collected for 1.4K video.

An example of automatic masks from SAM using my photo somewhere on the streets of Buenos Aires

An example of automatic masks from SAM using my photo somewhere on the streets of Buenos Aires

Phase 2: In the next step, this labeling was used to create an intermediate model (SAM2 Masks, S2M) to predict new labeling. The original SAM was used only for the first frame, after which the mask was predicted by S2M for all other frames, and adjustments were made again until the mask became perfect. In this phase, the S2M model was iteratively retrained twice, and a total of 63.5K new masks were collected.

Phase 3: In phase three, a full-fledged SAM 2 model with a memory block was used in the marking, and the markup engineers simply clicked on all the necessary masks. The model was interactively retrained 5 times and 197K new masks were collected along the way.

Validation: At each step, a separate group of taggers performed validation. First, they checked the correctness of the mask selection. Then, the tracking itself, that is, how well the mask was selected along the entire length of the video. Anything that was selected poorly was re-marked, anything that was very ambiguous was thrown out.

As a result, the approach was as follows: first we collect a few masks manually and for a long time, validate everything, train a lightweight model on them and use it to collect even more, validate, and then simultaneously improve the model and collect even more.

As a result, we reduced the marking time from 37 seconds to 4.5, measured as follows: we took the marking from phase 3 and phase 1 and calculated the time spent at a high IoU of the masks themselves (>0.75).

This is not a new approach, when in the process an intermediate model is made that helps to improve itself by accelerating its own markup, but in real life this is not used very often, alas.

Bravo, well done.

And let's see how this works in life.

Test drive of the SAM2 model and tracking

Simple examples – work so well that they are even boring. Even the original SAM is very good (I wrote an article about it and talked about our practical experience on ODS in VK), SAM2 turned out to be really faster and better. In fact, the model starts having problems only where either deep specificity begins, or in those situations where people's opinions on the topic of “what to consider an object” themselves differ greatly.

But in this version, what interests us most is the tracking. Spoiler: it's very good.

Initially, I wanted to collect the most complex examples from a wide variety of areas that would reveal the tracker's work with both complex objects and complex surrounding conditions.

Somewhere in the middle of such a list I found an imbalance among the examples: it’s mixfight.

Look: objects move very quickly, they can do it in a non-standard way, they can change very abruptly (here we have two standing vertical figures, and here they are already horizontal and one is blocked by the other), and different objects often have complex IoU ツ.

Detailed tracking, which takes out all the challenges and hardships of mixed martial arts, in principle, no longer requires simpler examples.

So I decided to go into detail on one example and just give a few completely different ones to make sure it works in all domains. I took the fastest knockout in martial arts history as a benchmark – because it has everything, and it's just beautiful.

So.

Let me remind you that this was done in two clicks (two because the first click selected a fighter who for some reason had no head). Phenomenal!

To understand what states the object was in during the 7 seconds of its tracking.

Different states of the tracking object

Different states of the tracking object

For a two-click tracker, everything is very good here. But since the task was to get to the bottom of anything, then here (and in the model as a whole) when two objects intersect, the relationship of one to the other is not always correctly determined.

Two adjacent frames, in the second the knee has completely merged with the head (although I'm not sure that at that moment they really weren't one whole).

Coloring page of sticking objects

Coloring page of sticking objects

This example extreme (strongly), but such a problem really does exist and it is complex in itself.

Another example of sticking together when stepping on one foot.

Stepping on foot caused objects to stick together

Stepping on foot caused objects to stick together

Is this really a problem? Of course not, the result is good in itself, and considering that it was only two clicks – it's damn good.

Maybe we can try tracking the component part of the object? In this example, I decided to highlight the shorts – and they worked well too.

It worked even too well, because the gloves were also red, but the model successfully found the detective shorts and worked on covering them with her hand.

This is phenomenal!

Examples from other domains

The gif shows a humorous “test for Japaneseness”: a person must crawl through a special hole inside a temple support. Entry point: a single click on the protruding arms, after which the tracking happily expands the object as it appears in the frame. There are a couple of frames where the head could already be highlighted, but this is easily fixed by adding a key frame with its selection.

10/10!

Africa broke SAM2. I chose a very difficult object – a road, and without dividing strips. And although there were enough key frames, at some point the tracking started to lose it in the video.

An example is more likely too extreme, because objects are often highlighted rather than the complex volumetric geometry in the frame when not shooting statically.

It was not possible to achieve correct operation, even adding key frames many times in the following frames and highlighting the road on them.

But the base of the power line (highlighted in orange) is perfectly tracked throughout the video, with only three key frames along the entire length.

Let's move on.

I wanted to see how tracking would work on food covered in thick sauce. But everything worked so well right away that it was a bit of a shame.

The story of the tracking losing the African road gave me the idea to try to find artifacts in the memory bank, and I found it on another African video (not mine anymore), but checking another thing.

I was interested in checking how the model handles “flying around” an object from a drone when the shooting angle and the object itself change significantly.

The flight itself was tracked with flying colors, but the artifact is at the end.

Zebras in Uganda

Zebras in Uganda

There is a problem here. We are tracking two zebras and everything is fine with them. But then the frame changes and there are already 5 zebras running, and two of them are still highlighted, although it is obvious that the scene has changed.

Do the new zebras look like the old ones? Yes. But we are tracking specific objects, and here the phenomenal ability of the model to generalize and remember, which so well detects the overlap of an object or its departure from the frame, did not notice the change of scene and did not select objects there quite correctly.

By the way, how do you even highlight objects on video? It's very easy, because we're working with a static frame, and there's SAM under the hood at maximum settings, and everything really takes just a few clicks.

Since the first SAM did not perform very well on parts of the face in photos of not the best quality, it was very interesting to see how SAM2 would perform.

Let's try tracking the objects “hair”, “mouth” and “eye”.

The SAM2 boss's intricate hair is super fine! The mouth is crooked, the eyes are something off.

Looks like it either has built-in AI protection, or the rumors about it aren't rumors at all!

Zuckerberg's mustache

Zuckerberg's mustache

Strange eye tracking

Strange eye tracking

Summing up and conclusions

SAM2 is incredibly good and all the examples where it doesn't work very well are more likely specifics or just complex cases that you wouldn't expect from a model that covers 70% of zero-shot cases.

So what is SAM2 insanely good for?

Firstly, there is a reason for the “segment” in the model’s name. In my opinion, one of the most direct purposes of this model is to significantly accelerate it. markups. With SAM2 you can segment anything, and then build a specific technology for your needs. The areas of application here are generally Allwhere video segmentation is required.

SAM2 will definitely change the approach to video marking, because, to be honest, after finishing it with a file for yourself, it is not entirely clear why you need to do marking differently. For now, this is a hypothesis, but we will definitely test it, because the model is wonderfully good.

Secondly, the model's tracking is excellent, but it requires few resources and a little time. The model is very good at separating the most complex objects from the background. There is probably no point in dragging the model into real time with making some decisions, but everything related to a tolerant attitude to delays in processing (light ones of a few seconds or even delayed ones) will most likely go down well even in its pure form.

An approximate range of such tasks is post-processing of video for entertainment purposes (imposing cool effects on an interesting replay in sports, for example), beautification of videos for social networks and a bit of genAi.

This thing is cool.

Thank you!

If you liked the article, then please ̶p̶r̶e̶l̶y̶ ̶p̶r̶e̶s̶ ̶I̶ ̶s̶t̶a̶r̶a̶l …

SAM's Adventure in Japan or How Computer Vision Sees a Geisha

AI trainer, neuro educator, assessor, crowd and markup artist – who are all these people and what is the difference?

Apple Moscow Office: How I Became an Expert from Scratch and Got to a Private Developer Party

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *