How to teach LLMs to understand videos? Part 2

As a result, by training and evaluating results only on English-language data, we got a model that could show good metrics on standard English-language benchmarks, but made stupid mistakes when communicating in Russian.

An example of a model response in which Russian and English are mixed

An example of a model response in which Russian and English are mixed

The most common problem is answers entirely in English or switching to English during the answering process. In order to build a model evaluation system that would reflect the quality of work in real and product scenarios, we decided to collect and mark our benchmark in Russian.

How we made a Russian-language video benchmark for Video-LLM

To assess the LLM's ability to understand the video modality, the task category “video question answering” (VideoQA) is usually used: the model receives a video and an instruction (question) in natural language as input, and must answer exactly in accordance with the content of the video and instructions. For our benchmark, we chose two VideoQA question formats:

If the first option tests the ability to choose the correct answer from those proposed, then the second allows you to evaluate the quality of generation in natural (Russian) language. These formats make it fairly easy to evaluate models across different skills. Each question tests a specific skill, such as Action Recognition, Spatial Reasoning, Temporal Perception, and others.

Translation of academic benchmarks

The Russian-language benchmark was based on academic English-language benchmarks. The selection took into account the variety of topics (everyday situations, activities, humorous videos, etc.), different levels of complexity and the range of skills being tested. The selected datasets were manually translated into Russian using a team of professional editors. The table below shows the benchmarks, their scope and topics:

Benchmark

Topics

Number of questions

Skills Tested

ActivityNet

Daily activities

1162

Action Recognition, Attribute Perception, Object Recognition

FunQA

Humor, magic, creativity

175

Action Reasoning, Object Reasoning, Spatial Reasoning, Temporal Reasoning

MSRVTT

Music, people, games, sports, news, education, TV shows, animation, transport, how to make, travel, science, animal children, documents, food, cooking, beauty, advertising

1524

Action Recognition, Attribute Perception, Object Recognition

MSVD

Daily activities

1797

Action Recognition, Counting Problems, Object Recognition

SQA3D

3D scenes

1656

Spatial Perception, Spatial Reasoning

STAR

Everyday situations

420

Action Recognition, Object Recognition

TGIF

Everyday situations

1785

Counting Problems, Temporal Perception, Temporal Reasoning

NExT-QA

Everyday situations

1895

Action Reasoning, Action Recognition, Temporal Perception, Temporal Reasoning

PAXION

Daily activities, walking on the street, entertainment, work, sports, daily household chores, crafts and crafts

159

Action Reasoning, Action Recognition, Temporal Perception, Temporal Reasoning

Collecting Additional Data

Final benchmark structure

Final benchmark structure

As a result of analyzing the topics and skills presented in academic benchmarks, it became clear to us that the Russian-language benchmark lacks data reflecting the Russian cultural code. We noticed that in academic benchmarks there are almost no videos focused on practical applied tasks (How-to, DIY videos).

For selected applied topics, materials were automatically collected from open sources. Next, the collected videos were filtered (video validation) on the markup platform ABC Elementary according to the following criteria: the video is of good quality, does not contain 18+ content, there is dynamics in the video.

At the next stage, markers composed questions of varying complexity for the filtered set of videos: one simple, one medium and two complex. And in the same task, the markers immediately filled in the answers.

Example:

Simple question: What is in the hands of a person on a bicycle?

Medium question: What crime did the person in the video commit?

Tough question: What unexpected action makes a video funny?

It was also required that the questions created be based on the specifics of the topic of the video, for example:

video: “In the video, a man is cooking scrambled eggs”;

specific question: “What does a person cook?”;

general question: “What room is the person in?”

When composing questions for the video, it was necessary to rely only on the visual component (the audio modality was not considered in this markup).

At the last stage of marking, the correctness of the questions and the correctness of the answers to the video were checked. Moreover, if the question was correct and the answer to it was correct, then the options for incorrect answers were filled in (with the condition that at least two incorrect answers are similar to the correct one). From the incorrect answer options and the correct answer, options were then formed for choosing an answer to the question.

Example

Question: “In which direction does the car from which the video is being filmed turn at the beginning of the video?”;

Correct answer: “To the left”;

Answer options: A: don't know, B: sideways, C: right, D: left, E: backwards.

Below is general information about the data collected in Russian from scratch. The skills are based on the most popular questions.

Category

Topics

Number of questions

Skills

Cultural code

Films (Soviet and Russian)

1069

Action Recognition, Attribute Perception, Counting Problems, Emotion Recognition, Object Localization, Object Recognition, Spatial Perception

Cultural code

Dances and celebrations (folk dances, Maslenitsa, Ivan Kupala)

698

Action Recognition, Counting Problems, Spatial Perception, Special QA, Temporal Perception, Temporal Reasoning

Cultural code

History, historical reconstructions

121

Action Reasoning, Action Recognition, Attribute Perception, Object Recognition, Special QA, Temporal Perception, Temporal Reasoning

Cultural code

Cartoons (Soviet and Russian)

415

Action Recognition, Attribute Perception, Counting Problems, Emotion Recognition, Temporal Perception

Cultural code

Theater

166

Action Recognition, Attribute Perception, Counting Problems, Emotion Recognition, Spatial Perception, Special QA

Cultural code

Humor

845

Action Recognition, Attribute Perception, Counting Problems, Emotion Recognition, Spatial Perception, Special QA, Temporal Action, Temporal Perception

Cultural code

Board games (backgammon, checkers, chess, etc.)

346

Action Recognition, Attribute Perception, Counting Problems, Special QA

Cultural code

Sports games (lapta, gorodki, fist fights, etc.)

460

Action Reasoning, Counting Problems, Object Localization, Spatial Perception, Special QA, Temporal Perception, Temporal Reasoning

How-to

Traffic rules (lessons, exams)

3289

Action Recognition, Attribute Perception, Counting Problems, Object Localization, Spatial Reasoning, Special QA, Temporal Perception

How-to

Sports and fitness (yoga, crossfit, pilates, etc.)

233

Action Reasoning, Counting Problems, Object Localization, Spatial Perception, Special QA, Temporal Perception, Temporal Reasoning

How-to

Food Recipes

483

Action Recognition, Attribute Perception, Counting Problems, Object Recognition, Special QA

How-to

Home renovation

2220

Action Reasoning, Action Recognition, Attribute Perception, Counting Problems, Object Recognition, Spatial Perception, Special QA, Temporal Perception

How-to

Device repair

2367

Action Reasoning, Action Recognition, Attribute Perception, Counting Problems, Object Recognition, Spatial Perception, Special QA

Analysis of collected data

In the resulting benchmark 23 253 questions to test various skills. We checked that the distribution of answers across classes was uniform: “A: 5263, B: 5310, C: 4876, D: 4719, E: 3085.” Most questions contain 5 options to select the correct answer.

Total number of unique videos in the benchmark 8865. Below are statistics on video length. The distribution of the number of examples by duration in seconds is as follows:

Duration, seconds

Number of examples

<=10

2420

10-30

2364

30-60

1027

60-240

3010

>240

44

Most of the videos in the benchmark are shorter than 4 minutes. Video duration statistics by data set (extreme values ​​of average length are highlighted in bold):

Benchmark

Average video length

Minimum video length

Maximum video length

ActivityNet

0:01:51

0:00:05

0:04:46

FunQA

0:00:24

0:00:02

0:01:56

MSRVTT

0:00:15

0:00:10

0:00:30

MSVD

0:00:10

0:00:02

0:00:47

SQA3D

0:00:05

0:00:02

0:00:14

STAR

0:00:31

0:00:08

0:01:13

TGIF

0:00:04

0:00:02

0:03:20

NExT-QA

0:00:40

0:00:10

0:03:00

PAXION

0:00:04

0:00:01

0:00:06

Movies

0:00:21

0:00:05

0:08:07

Cartoons

0:00:09

0:00:05

0:00:37

Dancing and festivities

0:01:36

0:00:07

0:03:00

Story

0:00:50

0:00:08

0:02:55

Humor

0:02:08

0:00:09

0:03:00

Food Recipes

0:00:47

0:00:11

0:02:58

Sports games

0:00:55

0:00:06

0:03:00

Board games

0:01:10

0:00:06

0:02:56

Theater

0:01:26

0:00:08

0:02:51

Traffic regulations

0:03:28

0:00:02

0:04:04

Sports and fitness

0:00:45

0:00:31

0:03:59

Home renovation

0:02:18

0:00:05

0:04:02

Device repair

0:03:33

0:01:19

0:04:02

Academic sets mostly contain short videos. We tried not only to increase the variety of topics, but also to increase the average length of the video due to the collected data.

The most common questions in the benchmark

The most common questions in the benchmark

The diagram shows the most common benchmark questions. Questions starting with the words “I’m standing”/“I’m sitting” refer to the SQA3D benchmark, where the question consists of two parts – a description of the space + a question testing spatial orientation skills. We can highlight action recognition as the most popular skill. Let us note the variety of questions in terms of wording, which makes the assessment of models more stable with respect to the model’s “memorization” of certain question designs at the training stage.

Generation of instructional Russian-language data

Now we can evaluate the quality of the models’ work on a Russian-language benchmark and calculate metrics on it. The next step is to increase the amount and variety of Russian-language data in model training. To effectively solve this problem, we decided to generate an instructional Russian-language set using LLM based on text descriptions for video recordings.

Basic approach

To generate the instruction set, we adapted the prompt described in the article JourneyDB: A Benchmark for Generative Image Understanding. The main idea is to provide text descriptions for the video and, based on them, create a dialogue between the LLM and the person, as if the neural network itself is looking at the video recordings and not reading the text. For greater variety, we used themes such as films, cartoons, graphic design and animation. Below are examples of descriptions.

Movies:
A woman in a dark dress walks in a dark room and says something, she approaches a man, straightens her hair and says something. The walls are made of red brick, smeared with cement.

Cartoons:
In the Soviet cartoon “Barankin, be a man!” Five white butterflies fly across the blue sky. They land on the brown roof of a house. Antennas are visible on the roof of the house.

Animation:
Beams of energy move from screen to screen, from one panel to another, lighting them up in different colors.

Next, we generated Russian-language dialogues based on the descriptions, following this plan:

  1. We empirically adapted the prompt to extract context and questions to generate Russian-language instructional data. Questions must relate to the visual content of the video highlighted by this prompt. Example result:

    {
     "Style": ["туристический", "информативный", "спокойный"],
     "Content": [
       "достопримечательности",
       "гид рассказывает о здании",
       "архитектурные элементы"
     ],
     "Atmosphere": ["спокойная", "солнечная", "барокко", "интерес"],
     "Оther": ["музыка на фоне", "шум людей"],
     "Questions": [
       {
         "Question": "Сколько человек в туристической группе?",
         "Answer": "Десять"
       },
       {
         "Question": "Какой архитектурный стиль описывал гид?",
         "Answer": "Барокко"
       },
       {
         "Question": "0 чём рассказывает гид?",
         "Answer": "Здание"
       },
       {
         "Question": "Какая погода была на экскурсии?",
         "Answer": "Солнечно"
       }
     ]
    }
  1. After receiving the dictionary with questions, we checked it for correctness using the following algorithm.

  • Initialize the correction line.

    clarification_prompts = ['\n Пример плохого ответа, так как']
  • We check the structure of the generated dictionary, the presence of key fields, the presence of questions and answers, and supplement the corrective line with the appropriate modifications.

    if questions_json.get('Style') is None:
      clarification_prompts.append("он не содержит поля 'Style',")
  • If there are modifications in the corrective line, we supplement its original prompt and try to regenerate the result. In this way, we give the model an example of a bad answer in order to take into account previous mistakes and not repeat them.

  • We try to correct it up to 15 times or until we get the correct dictionary with questions. In total, about 10% of the dictionaries were eliminated at this stage.

  1. Based on the result of the previous paragraph, a dictionary with the final dialogue was generated. Without further ado or new reasoning, we just brought everything to the desired structure and expanded the answers a little.

    [
     {
       "role": "user",
       "question": "Сколько человек в туристической группе?"
     },
     {
       "role": "assistant",
       "answer": "В туристической группе десять человек."
     },
     {
       "role": "user",
       "question": "О чём рассказывает гид?"
     },
     {
       "role": "assistant",
       "answer": "Гид рассказывает о здании и его архитектурных особенностях."
     },
     {
       "role": "user",
       "question": "Какой архитектурный стиль описывал гид?"
     },
     {
       "role": "assistant",
       "answer": "Гид описывал стиль барокко."
     },
     {
       "role": "user",
       "question": "Какая погода была на экскурсии?"
     },
     {
       "role": "assistant",
       "answer": "На экскурсии была солнечная погода."
     }
    ]
  2. The next step was to slightly modify the prompt for data generation – now, instead of general questions to understand what is happening (perception) and answers to them, the model had to create dialogues that included reasoning and analysis of the cause-and-effect relationships of what is happening in the video (reasoning). All points of the previously described approach were left unchanged, but the request text was slightly changed. Example of final dialogue:

    [
     {
       "role": "user",
       "question": "Почему в группе именно десять человек?"
     },
     {
       "role": "assistant",
       "answer": "Это количество позволяет создать комфортные условия для общения и взаимодействия с гидом, при этом не создавая больших толп."
     },
     {
       "role": "user",
       "question": "Какие особенности здания выделил гид?"
     },
     {
       "role": "assistant",
       "answer": "Гид акцентировал внимание на уникальной симметрии и сложных декоративных элементах в стиле барокко."
     },
     {
       "role": "user",
       "question": "Зачем гид упоминал барочный стиль?"
     },
     {
       "role": "assistant",
       "answer": "Цель заключалась в том, чтобы подчеркнуть величественность и историческую значимость здания, используя архитектурный контекст."
     }
    ]

The total generated dialogues were:

Subject

Number of questions (perception)

Number of questions (reasoning)

Movies

18,000

23,000

Cartoons

5000

25,000

Animation

2000

2000

Total

25,000

50,000

This approach allowed us to save time and money on manual marking of video recordings, as well as improve the metrics on the Russian-language benchmark.

Experiments and metrics

Experiments

For experiments with Russian-language instructional data, we chose the basic pipeline Video-LLaVA and model LLaMA 3.1 8b Instructions as an LLM.

We conducted two experiments:

  1. basic experiment: on the Video-LLaVA set (most of the dialogues are in English, there are also Chinese and Korean) – 890 thousand picture-text pairs, 800 thousand video-text pairs and 90 thousand dialogues without a visual component;

  2. experiment with Russian-language instructional data: the set from the previous section was added to the basic Video-LLaVA set at the second stage of model training (when both the adapter and LLM are trained) – another 75 thousand video-text pairs.

We tried adding Russian-language video-text pairs at both stages of model training (including the first, when we trained only the adapter), but did not notice any significant changes compared to the second experiment. Let's see how this affected the results on our benchmark.

Multiple Choice Questions

Both models responded almost identically to closed-ended, multiple-choice questions. This is the expected result, because in such tasks the model should return only the letter corresponding to the correct answer, and the skill of generating detailed answers from videos in Russian does not test these tasks. The metric in the “Culture Code” category increased slightly (by 1.7 p.p.) due to the subcategories “Movies” and “Cartoons” included in it, which is understandable: data from these categories was present in the Russian-language training set.

Open questions

Here the results were already more interesting. We took as a basis the approach to evaluating answers to open questions from Video-ChatGPT and Video-LLaVA, which was discussed in the previous part: when the correspondence of the answer to the reference is assessed using the GPT assistant. The Accuracy metric calculated in this way for almost all benchmark topics has increased noticeably (that is, the correspondence of the content of the answers to the references has improved).

In addition, we assessed the frequency of occurrence of English words in the responses of each of the models. In most cases, the presence of English words in the model's response meant that the model mistakenly switched to English at an arbitrary moment in the response. But sometimes the inclusion of words consisting of Latin letters was justified: for example, if we were talking about a foreign brand of car or household appliance. Therefore, we only took into account the number of “false” occurrences of English words: if they were absent in the reference, but they appeared in the model’s response.

MC-VideoQA (Accuracy), %

OE-VideoQA (GPT Accuracy), %

False occurrences of English words, % of tasks

Video data

EN

EN+RU

EN

EN+RU

EN

EN+RU

Academic

55.4

55.3

30.3

42.20

8.52

0.1

How-to

44.0

44.0

36.5

51.55

9.2

0.2

Culture_Code

44.2

45.9

31.2

47.68

6.1

0.1

Total

49.0

49.0

33.0

47.0

8.3

0.2

As can be seen from the table, in general, according to the GPT benchmark, accuracy increased by 14 p.p., in individual categories – from 12 to 16.5 p.p. In addition, the model, which was not trained on a Russian-language video set, 8.3% benchmark tasks gave answers at least partially in English where the reference did not suggest this, and the model, trained, among other things, on Russian-language instructional data, did this only in 0.2% tasks.

Conclusion

We described an interesting practical aspect of Video-LLM assessment – measuring the quality of understanding questions and generating answers in natural Russian. The proposed approach to selecting videos, composing questions and selecting metrics made it possible to obtain a benchmark suitable for assessing VIdeo-LLM both in terms of the quality of generation in Russian, and in terms of skills in understanding various topics relevant to our country – specific cultural ones phenomena and current practical areas. In addition, we examined a simple and effective way to improve the quality of answers in Russian by automatically generating Russian-language instructional data.

Thanks to everyone who took part in the research, as well as in the preparation of materials for publication: Roman Krainov @romacckka, Albina Akhmetgareeva @Colindonolwe, Alexander Kapitanov @hukenovs, Karina Kvanchiani @karinakvanchiani.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *