VideoMAE, ViViT and TimeSFormer Analysis

Every engineer working in the field of computer vision faces the tasks of detection, segmentation and “a hundred troubles – YOLO answer”. However, there comes a time when a new complex task appears on the horizon – video analysis and classification. Some prefer to bypass it, others try to solve it using traditional methods, but we will go further and learn how to solve it using transformers. For the purpose of familiarization, we will consider the most popular and effective approaches. Let's go!

ViViT (Video Vision Transformer)

ViViT (Video Vision Transformer) is one of the first models to use transformer architecture for video data analysis.

ViViT works similarly Vision Transformers (ViT)but the key difference is that the input data is split into a 3D sequence of images. This allows for temporal information in the video sequence to be taken into account.

General architecture of ViViT

General architecture of ViViT

Structure of the model

1. Patch Embedding:

  • The video is broken down into small 3D patches (e.g. 16×16 pixels in size with a time depth of several frames).

  • Each patch is then transformed into an embedding using a linear projection, which is a vector space.

  • To account for the sequence of patches, positional embeddings are added. Positional embeddings are a way for the neural network to “remember” the order of patches. Accordingly, we obtain information about the position of the patch in the original image.

Important addition: In ViViT, positional embeddings are learnable parameters. This means that the network itself “learns” the optimal way to encode information about the position of patches.

2. Transformer Encoder:

  • Each patch (embedding and positional embedding) is fed to the transformer input

  • The transformer consists of several layers with an attention mechanism that helps the model learn the dependencies between patches.

3. Classification Head:

  • After passing through the encoder, the model outputs a feature vector for classification.

  • This vector is fed into a classification layer, which outputs probabilities for each class.

Now let's move on to the code:

Code

First, you need to install all the necessary libraries:

pip install transformers torch pillow opencv-python

Thanks to the transformers library, our tests will be as simple as possible and with a minimum amount of code:

# Импортируем необходимые библиотеки
from transformers import VivitForVideoClassification, VivitImageProcessor
import torch
import cv2

# Загружаем модель и процессинг кадров
model_name = "google/vivit-b-16x2-kinetics400"
processor = VivitImageProcessor.from_pretrained(model_name)
model = VivitForVideoClassification.from_pretrained(model_name)

# Функция для загрузки и обработки видео
def load_video(video_path, num_frames=32, frame_height=480, frame_width=480):
    cap = cv2.VideoCapture(video_path)
    frames = []
    while len(frames) < num_frames:
        ret, frame = cap.read()
        if not ret:
            break
        frame = cv2.resize(frame, (frame_width, frame_height))
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        frames.append(frame)
    cap.release()
    if len(frames) < num_frames:
        raise ValueError(f"Video is too short. Needed: {num_frames}, Found: {len(frames)}")
    return frames

# Функция для загрузки меток
def load_labels(label_path):
    with open(label_path, 'r') as f:
        labels = f.read().splitlines()
    return labels


# Функция для получения предсказаний
def predict_label(video_path, label_map_path):
    frames = load_video(video_path)
    inputs = processor(images=frames, return_tensors="pt")
    outputs = model(**inputs)
    # Загрузка меток
    kinetics_labels = load_labels(label_map_path)
    # Добавим вывод вероятностей
    logits = outputs.logits
    probabilities = torch.nn.functional.softmax(logits, dim=-1)
    top_5_probs, top_5_indices = torch.topk(probabilities[0], 5)
    top_5_predictions = [kinetics_labels[idx.item()] for idx in top_5_indices]

    print("Топ 5 предсказаний:")
    for i, (label, prob) in enumerate(zip(top_5_predictions, top_5_probs), 1):
        print(f"{i}. {label}: {prob.item() * 100:.2f}%")

# Путь к видеофайлу
video_path="snow.mp4"

# Загрузка label_map.txt происходит по след. ссылке: https://github.com/google-deepmind/kinetics-i3d/blob/master/data/label_map.txt
# Получение предсказаний
predict_label(video_path, "label_map.txt")

A video fragment of snowboarding was taken as an example:

Top 5 Predictions ViViT:

  1. snowboarding: 90.55%

  2. Skiing (not slalom or crosscountry): 7.75%

  3. Ski jumping: 0.98%

  4. faceplanting: 0.28%

  5. tobogganing: 0.25%

Not bad, right?

Model running time on CPU: 5.85 seconds

Model running time on GPU: 0.84 seconds

TimeSFormer

Overall, the approach is similar to ViViT, but the main difference between them is the way patches are processed and the organizational structure of temporal and spatial dependencies.

Structure of the model

Let's also briefly go over the structure:

1. Patch Embedding:

  • Each frame is divided into small patches (similar to Vision Transformer, ViT).

  • Patches from each frame are aligned along the time axis, forming a multi-layered sequence of patches.

  • Patches are transformed into fixed-dimensional vectors via linear projection.

  • Positional encoding is added to these vector representations to preserve information about the position of patches in space and time.

2. Transformer Encoder:

  • The encoder consists of several layers of attention mechanisms and feed-forward networks.

  • TimeSformer uses separate attention to spatial and temporal aspects.

  • Space Attention (spatial attention) is applied to all patches in each time frame independently.

  • Time Attention (temporal attention) is applied to all patches from all temporal frames for each spatial patch.

3. Classification Head:

  • After passing through the encoder, the model outputs a feature vector for classification.

  • This vector is fed into a classification layer, which outputs probabilities for each class.

TimeSformer alternates layers for processing spatial dependencies (within one frame) and temporal dependencies (between different frames), and as a result, unlike ViViT, processes spatial and temporal attention separately.

By separately computing temporal and spatial self-attention, TimeSformer can be more efficient for processing long videos, and also computationally since there is no need to process all patches at once.

Spatiotemporal attention TimeSFormer

Spatiotemporal attention TimeSFormer

Let's move on to the code:

Code

The code is absolutely identical to that given in the ViViT section, the only difference is in the model:

from transformers import AutoImageProcessor, TimesformerForVideoClassification

model_name = "facebook/timesformer-base-finetuned-k400"
processor = AutoImageProcessor.from_pretrained(model_name)
model = TimesformerForVideoClassification.from_pretrained(model_name)

Top 5 Predictions TimeSformer:

  1. snowboarding: 53.77%

  2. Skiing (not slalom or crosscountry): 40.71%

  3. somersaulting: 2.59%

  4. snowkiting: 0.90%

  5. Ski jumping: 0.74%

Model running time on CPU: 1.25 seconds

Model running time on GPU: 0.63 seconds

There are already big doubts here between snowboarding and skiing, but my task is to show the differences in technique, speed and results between the models.

Video Masked Autoencoders

First, let's figure out what autoencoders are. In short, Autoencoders are a type of neural network that is used to learn how to encode data efficiently. It consists of two parts:

1. Encoder: compresses the input data into a compact representation (code).

2. Decoder: Recovers the original data from this code.

The main goal is to learn a useful representation of the data by minimizing the difference between the original and reconstructed data.

There is a detailed article about autoencoders on Habr: https://habr.com/ru/companies/skillfactory/articles/671864/

Masked Autoencoders

Masked Autoencoder Operation

Masked Autoencoder Operation

Masked Autoencoders

The main idea is to restore the original data from their partial or distorted versions. To do this, the model is trained to understand the context and structure of the data by hiding part of the input data using masking.

The process can be divided into three stages:
1) Masking:

  • The input image is split into patches (small blocks).

  • A random or fixed mask is used to “hide” certain parts of the data

  • The original and masked patches are passed to the encoder.

2) Encoder:

The important part is that the encoder does not see the masked parts of the data, it is trained to recognize the structure and context only based on the data available to it.

3) Decoder:

  • The decoder receives hidden representations from the encoder, which contain visible patches and mask tokens. Its goal is to reconstruct the full data, including the masked parts.

  • It is estimated how much the reconstructed data differs from the original masked data. The most commonly used metric is the mean square error (MSE).

VideoMAE

We have already established that MAE is a model that restores masked parts of data. In the context of video, this means working with temporal aspects of the video, selectively hiding some fragments in frames or even entire frames, which the model then has to restore. This approach forces the model to learn from local and global features of each frame, improving the overall understanding of the video content. All this allows the model to independently detect patterns and structures in the video, without relying on pre-labeled data.

Thus, VideoMAE is an autoencoder that acts as a data-efficient tool for self-supervised learning. This technology was developed to improve the efficiency of training models on video, minimizing the high costs of data and computation.

The architecture of VideoMAE is similar to MAE, in short:

  1. Masking.

  2. Encoder.

  3. Decoder with masked patches.

  4. Video reconstruction.

VideoMAE Architecture

VideoMAE Architecture

Now let's move on to the code itself:

Code

The code is also similar to those given earlier, but with minor changes:

from transformers import VideoMAEForVideoClassification, VideoMAEImageProcessor


model_name = "MCG-NJU/videomae-base-finetuned-kinetics"
model = VideoMAEForVideoClassification.from_pretrained(model_name)
feature_extractor = VideoMAEImageProcessor.from_pretrained(model_name)

# Изменения также касаются num_frames
def load_video(video_path, num_frames=16, frame_height=480, frame_width=480):
    cap = cv2.VideoCapture(video_path)
    frames = []
    try:
        while len(frames) < num_frames:
            ret, frame = cap.read()
            if not ret:
                break
            frame = cv2.resize(frame, (frame_width, frame_height))
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frames.append(frame)
    except Exception as e:
        print(f"Error reading video: {e}")

    cap.release()

    if len(frames) < num_frames:
        raise ValueError(f"Video is too short. Needed: {num_frames}, Found: {len(frames)}")

    return frames

Top 5 predictions:

  1. snowboarding: 41.77%

  2. Skiing (not slalom or crosscountry): 13.58%

  3. tobogganing: 2.46%

  4. biking through snow: 1.07%

  5. motorcycling: 0.33%

Model running time on CPU: 0.77 seconds

Model running time on GPU: 0.17 seconds

VideoMAE shows worse results on this video, but excellent results in terms of processing speed.

Education

Let's now talk about working with a video classifier, but using our data.

I will briefly talk about two approaches:

  1. Directly training the entire model on new classes (end-to-end): will work if there are enough computing resources and data for training.

  2. Using pretrained models only as a feature extractor in combination with a classifier: if you have frequently updated classes.

Approach number 1

Here everything is absolutely identical to standard transfer learning:

# Замораживаем все слои, кроме последнего
for param in model.parameters():
    param.requires_grad = False

model.classifier.requires_grad = True

# Создадим класс для загрузки данных
class VideoDataset(torch.utils.data.Dataset):
    def __init__(self, video_paths, labels, num_frames=16):
        self.video_paths = video_paths
        self.labels = labels
        self.num_frames = num_frames
        self.processor = VideoMAEImageProcessor.from_pretrained(model_name)

    def __len__(self):
        return len(self.video_paths)

    def __getitem__(self, idx):
        video_path = self.video_paths[idx]
        label = self.labels[idx]
        frames = load_video(video_path, num_frames=self.num_frames)
        inputs = self.processor(frames, return_tensors="pt")
        return inputs['pixel_values'].squeeze(0), label

      
'''
Далее настраиваем пути до наших данных
Выбираем подходящую loss функцию и выполняем стандартный цикл обучения 
'''
dataset = VideoDataset(video_paths, labels)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=2, shuffle=True)

for epoch in range(num_epochs):
  ...
  

The output is weights that we can use for inference on new video data.

Approach number 2

The main idea is to use the transformer as a feature extractor, save the obtained features and then train the classifier model.

The advantage of this approach is that there is no need to constantly retrain the transformer models: you can simply extract features for new data and train a new lightweight classifier, which reduces computational and time costs.

The downside is obviously that the feature extractor model must match your expectations, i.e. be able to extract the required features for further analysis.

Let's very roughly depict the approach in code:

# Изменим получение предсказаний на получение признаков
def extract_features(video_path):
    frames = load_video(video_path)
    inputs = processor(frames, return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
    features = outputs.hidden_states[-1].squeeze(0).mean(dim=1)
    return features.cpu()

# Добавим сохранение признаков
def save_features(features, save_path):
    torch.save(features, save_path)

# Добавим пути до наших видео (только для примера в таком формате)
video_paths = {
    'class1': ['video1.mp4'],
    'class2': ['video2.mp4'],
}

# Извлечем и сохраним наши признаки в отдельные файлы (опять же, только для примера)
for class_name, paths in video_paths.items():
    class_save_dir = os.path.join(save_dir, class_name)
    os.makedirs(class_save_dir, exist_ok=True)
    for video_path in paths:
        features = extract_features(video_path)
        video_name = os.path.basename(video_path).split('.')[0]
        save_path = os.path.join(class_save_dir, f'{video_name}_features.pt')
        save_features(features, save_path)

Then we can do whatever we want with the obtained features. One option is to train a classifier with triplet loss:

from pytorch_metric_learning import losses
from pytorch_metric_learning.miners import TripletMarginMiner
from torch.utils.data import Dataset

# Создаем класс для загрузки полученных ранее фичей
class FeaturesLoad(Dataset):
    def __init__(self, features_dir):
      ...
    def __getitem__(self, idx):
      ...
      return feature, label

# Описываем модель классификатора или используем готовый
class SimpleCls(nn.Module):
  ...

model = SimpleСls(...).to(device)
criterion = losses.TripletMarginLoss(margin=0.1)
triplet_miner = TripletMarginMiner(margin=0.2, type_of_triplets="hard")

# Обучение
for epoch in range(num_epochs):
    for features, labels in dataloader:
        features, labels = features.to(device), labels.to(device)
        
        optimizer.zero_grad()
        embeds = model(features)
        triplets = triplet_miner(embeds, labels)
        loss = criterion(embeds, labels, triplets)
        loss.backward()
        optimizer.step()

Comparison of models

Let's briefly talk about which model to choose in which situations:

When to choose ViViT:

  • When high precision is required.

  • If you have large computing resources (GPU/TPU).

  • For working with large video datasets where good model scalability is critical.

  • The speed of inference and training is not critical.

When to choose TimeSFormer:

  • Work with long video fragments is required.

  • If there are video classes containing different fragment lengths.

  • If tasks require taking into account both temporal and spatial dependencies (e.g., object tracking and scene understanding tasks).

When to choose VideoMAE:

  • If performance and saving computing resources are important.

  • If a large number of videos contain noise or distortion.

  • When robustness to incomplete data is required.

  • When the goal is to reduce computational costs without significant loss of accuracy.

Conclusion

I hope this article was useful and will become a starting point for learning video analytics and video classification! Thank you for your attention!

Links to articles:

  1. “ViViT: A Video Vision Transformer”

  2. “Is Space-Time Attention All You Need for Video Understanding?”

  3. “VideoMAE: Masked Autoencoders for Self-Supervised Video Representation Learning”

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *