How to teach LLMs to understand videos? Overview of approaches

Hi all! Today we’ll talk about the task of analyzing and understanding videos and the evolution of approaches to training multimodal large language models for this task.

Roadmap of the output of vision models

Roadmap of the output of vision models

Video Understanding task

Video Understanding is a direction at the intersection of computer vision (CV) and natural language processing (NLP), which includes many different tasks for the perception and interpretation of video: from basic recognition of objects and objects in a video sequence, localization of objects in space or time, counting of objects and people to the generation of short or detailed descriptions of videos and tasks for reasoning about the reasons for what is happening in the video, requiring a deep understanding of the world – from human psychology to the physical properties of objects.

The rapid development of Vision LLM (VLLM) – LLM with support for visual modalities – in 2023-2024 helped bring neural networks' understanding of video much closer to how humans do it. VLLMs are capable of answering a wide variety of video questions in natural language. Instructional training allows a single model to be trained to solve multiple video understanding tasks, and LLM's large body of knowledge and understanding of diverse contexts allows VLLM to analyze video content and make complex inferences.

Flamingo

The first significant work devoted to VLLM was Flamingo (NeurIPS 2022) – contained many interesting ideas that were reused many times in subsequent articles.

Flamingo

Flamingo

For example:

  • Using interleaved data for training, when picture tokens can be located in any part of the text prompt, and not just at the beginning or end.

  • Freeze the Vision Encoder during the entire training.

  • Independent encoding of video frames (like images), but with the addition of trainable temporal embeddings to take into account the temporal aspect.

  • Using Perceiver Resampler to turn arbitrary length visual embeddings output from the visual encoder into a compact fixed length representation (64 tokens) before feeding the visual information to the LLM.

  • Next Token Prediction as a learning task.

From what was rarely seen in subsequent works:

  • Normalizer-Free ResNet (NFNet) is used as a visual encoder.

  • Video frames are sampled very frequently (1 FPS).

  • The language model is frozen throughout learning. Instead, new trained cross-attention layers are “laid” between the LLM blocks, which help transmit visual information to the LLM.

  • Instead of generalizing the model for a large number of different tasks of understanding images and videos, the emphasis is on few-shot in-context learning, when all the information about a new task and examples of its implementation are given to the model directly in the text-visual prompt.

LLaVA

LLaVA (April 2023) and LLaVA 1.5 (Oct 2023) – VLLM for image understanding, but no video support. The articles laid the architectural and coding foundation for many subsequent articles on the topic of VLLM for video understanding:

LLaVA

LLaVA

  • Architecturally, the model consists of a CLIP-based visual encoder, an MLP adapter (in the first version – one linear layer) between the visual embedding space of CLIP and LLM, and a large language model.

  • Two-stage training: first only the adapter is trained, then both the adapter and the LLM.

  • At the first stage, the training data uses simple tasks consisting of images and brief descriptions of them (image captioning task). At the second stage – more complex tasks that require detailed perception and analysis of images (various options for VQA – visual question answering).

VideoChat, Video-ChatGPT, Valley

In May-June 2023, a whole series of works was published – VideoChat, Video-ChatGPT, Valley– whose authors sought to “humanize” VLLMs – to improve their functioning as chatbots or assistants. Each of the papers proposes an approach to generating instructional dialogues for training that allow LLMs to learn how to maintain video dialogues and answer a variety of questions. In all three cases, conversations are generated using ChatGPT based on video descriptions.

For example, Valley generates three types of tasks – “detailed description”, “dialogue” and “complex reasoning”. In addition, Valley uses:

  • two-stage training and basic architecture from LLaVA;

  • general encoder for pictures and videos – CLIP with ViT-L/14 architecture;

  • a temporal module based on a single-layer transformer encoder and Average Pooling – for modeling the temporal variability of frames and aggregating video tokens that are sent to LLM (initially frames from the video are sampled at a frequency of 0.5 FPS).

Unlike Valley, in Video-ChatGPT:

  • a model from LLaVA that has already been pretrained on images is used;

  • Average Pooling is applied separately not only to all tokens within each frame (to model temporal variability), but also separately along the temporal dimension between the corresponding frame tokens, to obtain averaged spatial features of the entire video.

An important innovation in Video-ChatGPT is the method proposed by the authors for assessing the performance of VLLM on open questions using GPT-3.5: text LLM analyzes the correspondence of the model’s prediction to the reference according to a number of criteria and gives a score from 0 to 5. Later in Video-LLaVA this approach was expanded to simplifying the calculation of accuracy: the model now returned not only a score from 1 to 5, but also a binary value: “yes” (correct answer) or “no” (incorrect).

Video-LLaMA

Video-LLaMA (June 2023) – the first VLLM with support for understanding both video and audio in video clips. As adapters between video modality and LLM, audio modality and LLM, a combination of Q-Former (from BLIP-2) and linear layer. Interestingly, audio and video are processed independently in all stages except LLM.

Video-LLaMA

Video-LLaMA

Video-LLaVA

Video‑LLaVA (November 2023) – perhaps the most significant of the first works on VLLM with video understanding support due to its good generalization to a variety of tasks and metrics, significantly superior to previous approaches.

  • Video and image encoders from LanguageBind: trained through contrastive learning to encode images and videos into a common embedding space (which allows the use of a common MLP adapter for two modalities).

  • Uniformly sample a fixed number of video frames, regardless of length—eliminate the need to use Perceiver Resampler or Q‑Former to obtain a small, fixed number of video tokens.

  • Reuse of LLaVA data (558 thousand and 349 thousand images, 558 thousand and 624 thousand dialogues at the first and second stages of training, respectively), Valley (229 thousand videos and 703 thousand dialogues based on them for the 1st stage of training) , VideoChatGPT (13 thousand videos and 100 thousand dialogues for the 2nd stage of training).

Video-LLaVA

Video-LLaVA

LITA

LITA (Language Instructed Temporal‑Localization Assistant) – an article published in March 2024 is devoted to how to improve the ability of VLLM to understand the temporal aspect of video, especially in the task of localizing objects, people, events, etc. in time (Temporal Localization). To achieve this goal, the authors came up with a number of innovations:

  • Time tokens encoding the relative positions of frames within a video. Each video is divided into T equal intervals (T = 100), and the relative positions of the intervals are encoded with time tokens from <1> to.

LITA

LITA

  • “Slow” and “fast” frame tokens (SlowFast tokens). First, T frames are uniformly sampled from the video, each independently run through a visual encoder (CLIP-L-14), resulting in the video being represented by 100 × 256 visual tokens. Next, to obtain “fast” frame tokens, all 256 tokens of one frame are averaged among themselves, resulting in 100 “fast” tokens per video. “Slow” tokens are sampled much less frequently – only 4 frames per video, and within each frame the number of tokens is reduced by 4 times using spatial averaging (2 × 2 spatial average pooling), due to which there are only 256 “slow” tokens per video. Next, all 356 tokens are concatenated and submitted to the LLM.

  • A new dataset for training and evaluating temporal localization skills is ActivityNet‑RTL. It is dedicated to a new type of task – RTL (Reasoning Temporal Localization), “localization in time that requires reasoning”: video tasks are designed in such a way that for the correct answer, the model must synthesize an understanding of the temporal aspect of the video and knowledge about the world. For example, to answer the question: “When does the dance become more energetic?” you need not only to navigate in time, but also to understand what “energetic dance” is.

PLLaVA

PLLaVA (Pooling LLaVA) was released in April 2024. The main feature of the approach, as the name suggests, is a simple and effective visual feature pooling strategy.

PLLaVA

PLLaVA

Main features:

  • As in Video‑ChatGPT, it is based on a model that has already been trained on the image modality (and not just on texts) – here it is LLaVA Next.

  • It is experimentally proven that previous approaches to aggregation of visual tokens (simply concatenation, as in Video‑LLaVA, or Average Pooling over all dimensions, as in Video‑ChatGPT) lead to degradation of metrics when the size of the data set increases or the use of text prompts that were not there in teaching.

  • An alternative feature dimensionality reduction strategy is proposed: only 16 frames are sampled from the video, but adaptive pooling with averaging is applied only to the spatial resolution (averaging along the time axis led to worse results). The LLM contains a tensor of temporary representations of dimension 16 × 12 × 12 × d, where 16 is the number of frames, 12 × 12 is the number of tokens of each frame after pooling, and d depends on the dimension of the specific LLM.

  • When additional training is performed on video data, the adapter is completely unfrozen, and LLM training occurs using LoRA.

  • The final model showed SOTA results in a number of video benchmarks, but did a better job of generating detailed video descriptions.

Video-SALMONN

Video‑SALMONN (June 2024) – the first approach where LLM is able to synthesize information from video sequences not only with accompanying sounds, but also with speech. Main features of the work:

  • Used to encode video frames InstructBLIPfor audio and speech – BEATs And Whisperrespectively. The audio signal is sampled at 50 Hertz, and the video at 2 Hertz (or 2 FPS). Next, all signs are synchronized in time every 0.5 seconds and concatenated before being fed into the adapter: the embedding of each frame is supplemented by two sets of embeddings of 25 audio samples (speech and non-speech). If any of the modalities is missing in a specific element of the data set, then a vector of zeros is used instead of its embeddings.

  • The MRC Q‑Former (multiresolution causal Q‑Former) is used as an adapter. The output is a fixed-size common embedding for all three modalities. The adapter uses two levels of temporal resolution – high and low.

  • The LLM weights are not completely unfrozen: additional training occurs using LoRA (low‑rank adaptation).

Visual encoder and combination of speech and audio encoders

Visual encoder and combination of speech and audio encoders

LLaVA-OneVision

LLaVA‑OneVision (August 2024) – SOTA approach to understanding three visual modalities at once: single images, multiple and video. Key Features:

  • Classic architecture (visual encoder, MLP adapter, LLM).

    Model training stages

    Model training stages

  • Interleaved data: visual embeddings can appear in any part of the prompt.

  • Uses a visual encoder SigLIP (each image is encoded with 729 tokens). It is common to all modalities and is unfrozen at most stages of learning (except the first).

  • A lot of good data (including previously published sets from other authors, filtered and relabeled as part of the work).

  • Higher AnyRes image processing method: before visual encoding, the image is cut into several parts (up to 36) depending on the resolution, each of which is encoded and passed through the adapter independently. The resulting embeddings are concatenated and reduced to a maximum size of 729 × 9 using bilinear interpolation.

    One-Vision

    One-Vision

  • An effective approach to adaptation for video and multi-image: more “heavy” modalities – video and multiple images – are added only at the last stage of training. Before submitting to LLM, embeddings of all modalities are brought to approximately the same size (6–8 thousand visual tokens per dialogue).

  • As in Video-LLaVA, video frames are sampled evenly, but thanks to compression to 196 tokens per frame, it was possible to efficiently train LLM on 32 frames instead of 8.

Conclusion

Today we talked about key work on training large language models to understand video, and how the approach to this task has evolved over the past few years. This is not all the work, but we tried to take into account the most basic and interesting things. Next time we’ll tell you how we teach LLMs to understand videos and maintain a dialogue about them in Russian, and how we evaluate this skill in order to compare different models with each other.

Thanks to everyone who participated in research in this area. And special thanks to the author of the article – Marina Yaroslavtseva @magolileader of the video track direction in multimodality, in the RnD XR team.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *