How to teach LLMs to understand videos? Part 2

As a result, by training and evaluating results only on English-language data, we got a model that could show good metrics on standard English-language benchmarks, but made stupid mistakes when communicating in Russian.

An example of a model response in which Russian and English are mixed

The most common problem is answers entirely in English or switching to English during the answering process. In order to build a model evaluation system that would reflect the quality of work in real and product scenarios, we decided to collect and mark our benchmark in Russian.

How we made a Russian-language video benchmark for Video-LLM

To assess the LLM's ability to understand the video modality, the task category “video question answering” (VideoQA) is usually used: the model receives a video and an instruction (question) in natural language as input, and must answer exactly in accordance with the content of the video and instructions. For our benchmark, we chose two VideoQA question formats:

If the first option tests the ability to choose the correct answer from those proposed, then the second allows you to evaluate the quality of generation in natural (Russian) language. These formats make it fairly easy to evaluate models across different skills. Each question tests a specific skill, such as Action Recognition, Spatial Reasoning, Temporal Perception, and others.

Translation of academic benchmarks

The Russian-language benchmark was based on academic English-language benchmarks. The selection took into account the variety of topics (everyday situations, activities, humorous videos, etc.), different levels of complexity and the range of skills being tested. The selected datasets were manually translated into Russian using a team of professional editors. The table below shows the benchmarks, their scope and topics:

Benchmark	Topics	Number of questions	Skills Tested
ActivityNet	Daily activities	1162	Action Recognition, Attribute Perception, Object Recognition
FunQA	Humor, magic, creativity	175	Action Reasoning, Object Reasoning, Spatial Reasoning, Temporal Reasoning
MSRVTT	Music, people, games, sports, news, education, TV shows, animation, transport, how to make, travel, science, animal children, documents, food, cooking, beauty, advertising	1524	Action Recognition, Attribute Perception, Object Recognition
MSVD	Daily activities	1797	Action Recognition, Counting Problems, Object Recognition
SQA3D	3D scenes	1656	Spatial Perception, Spatial Reasoning
STAR	Everyday situations	420	Action Recognition, Object Recognition
TGIF	Everyday situations	1785	Counting Problems, Temporal Perception, Temporal Reasoning
NExT-QA	Everyday situations	1895	Action Reasoning, Action Recognition, Temporal Perception, Temporal Reasoning
PAXION	Daily activities, walking on the street, entertainment, work, sports, daily household chores, crafts and crafts	159	Action Reasoning, Action Recognition, Temporal Perception, Temporal Reasoning

Collecting Additional Data

As a result of analyzing the topics and skills presented in academic benchmarks, it became clear to us that the Russian-language benchmark lacks data reflecting the Russian cultural code. We noticed that in academic benchmarks there are almost no videos focused on practical applied tasks (How-to, DIY videos).

For selected applied topics, materials were automatically collected from open sources. Next, the collected videos were filtered (video validation) on the markup platform ABC Elementary according to the following criteria: the video is of good quality, does not contain 18+ content, there is dynamics in the video.

At the next stage, markers composed questions of varying complexity for the filtered set of videos: one simple, one medium and two complex. And in the same task, the markers immediately filled in the answers.

Example:

Simple question: What is in the hands of a person on a bicycle?
Medium question: What crime did the person in the video commit?
Tough question: What unexpected action makes a video funny?

It was also required that the questions created be based on the specifics of the topic of the video, for example:

video: “In the video, a man is cooking scrambled eggs”;
specific question: “What does a person cook?”;
general question: “What room is the person in?”

When composing questions for the video, it was necessary to rely only on the visual component (the audio modality was not considered in this markup).

At the last stage of marking, the correctness of the questions and the correctness of the answers to the video were checked. Moreover, if the question was correct and the answer to it was correct, then the options for incorrect answers were filled in (with the condition that at least two incorrect answers are similar to the correct one). From the incorrect answer options and the correct answer, options were then formed for choosing an answer to the question.

Example

Question: “In which direction does the car from which the video is being filmed turn at the beginning of the video?”;
Correct answer: “To the left”;
Answer options: A: don't know, B: sideways, C: right, D: left, E: backwards.

Below is general information about the data collected in Russian from scratch. The skills are based on the most popular questions.

Category	Topics	Number of questions	Skills
Cultural code	Films (Soviet and Russian)	1069	Action Recognition, Attribute Perception, Counting Problems, Emotion Recognition, Object Localization, Object Recognition, Spatial Perception
Cultural code	Dances and celebrations (folk dances, Maslenitsa, Ivan Kupala)	698	Action Recognition, Counting Problems, Spatial Perception, Special QA, Temporal Perception, Temporal Reasoning
Cultural code	History, historical reconstructions	121	Action Reasoning, Action Recognition, Attribute Perception, Object Recognition, Special QA, Temporal Perception, Temporal Reasoning
Cultural code	Cartoons (Soviet and Russian)	415	Action Recognition, Attribute Perception, Counting Problems, Emotion Recognition, Temporal Perception
Cultural code	Theater	166	Action Recognition, Attribute Perception, Counting Problems, Emotion Recognition, Spatial Perception, Special QA
Cultural code	Humor	845	Action Recognition, Attribute Perception, Counting Problems, Emotion Recognition, Spatial Perception, Special QA, Temporal Action, Temporal Perception
Cultural code	Board games (backgammon, checkers, chess, etc.)	346	Action Recognition, Attribute Perception, Counting Problems, Special QA
Cultural code	Sports games (lapta, gorodki, fist fights, etc.)	460	Action Reasoning, Counting Problems, Object Localization, Spatial Perception, Special QA, Temporal Perception, Temporal Reasoning
How-to	Traffic rules (lessons, exams)	3289	Action Recognition, Attribute Perception, Counting Problems, Object Localization, Spatial Reasoning, Special QA, Temporal Perception
How-to	Sports and fitness (yoga, crossfit, pilates, etc.)	233	Action Reasoning, Counting Problems, Object Localization, Spatial Perception, Special QA, Temporal Perception, Temporal Reasoning
How-to	Food Recipes	483	Action Recognition, Attribute Perception, Counting Problems, Object Recognition, Special QA
How-to	Home renovation	2220	Action Reasoning, Action Recognition, Attribute Perception, Counting Problems, Object Recognition, Spatial Perception, Special QA, Temporal Perception
How-to	Device repair	2367	Action Reasoning, Action Recognition, Attribute Perception, Counting Problems, Object Recognition, Spatial Perception, Special QA

Analysis of collected data

In the resulting benchmark 23 253 questions to test various skills. We checked that the distribution of answers across classes was uniform: “A: 5263, B: 5310, C: 4876, D: 4719, E: 3085.” Most questions contain 5 options to select the correct answer.

Total number of unique videos in the benchmark 8865. Below are statistics on video length. The distribution of the number of examples by duration in seconds is as follows:

Duration, seconds	Number of examples
<=10	2420
10-30	2364
30-60	1027
60-240	3010
>240	44

Most of the videos in the benchmark are shorter than 4 minutes. Video duration statistics by data set (extreme values of average length are highlighted in bold):

Benchmark	Average video length	Minimum video length	Maximum video length
ActivityNet	0:01:51	0:00:05	0:04:46
FunQA	0:00:24	0:00:02	0:01:56
MSRVTT	0:00:15	0:00:10	0:00:30
MSVD	0:00:10	0:00:02	0:00:47
SQA3D	0:00:05	0:00:02	0:00:14
STAR	0:00:31	0:00:08	0:01:13
TGIF	0:00:04	0:00:02	0:03:20
NExT-QA	0:00:40	0:00:10	0:03:00
PAXION	0:00:04	0:00:01	0:00:06
Movies	0:00:21	0:00:05	0:08:07
Cartoons	0:00:09	0:00:05	0:00:37
Dancing and festivities	0:01:36	0:00:07	0:03:00
Story	0:00:50	0:00:08	0:02:55
Humor	0:02:08	0:00:09	0:03:00
Food Recipes	0:00:47	0:00:11	0:02:58
Sports games	0:00:55	0:00:06	0:03:00
Board games	0:01:10	0:00:06	0:02:56
Theater	0:01:26	0:00:08	0:02:51
Traffic regulations	0:03:28	0:00:02	0:04:04
Sports and fitness	0:00:45	0:00:31	0:03:59
Home renovation	0:02:18	0:00:05	0:04:02
Device repair	0:03:33	0:01:19	0:04:02

Academic sets mostly contain short videos. We tried not only to increase the variety of topics, but also to increase the average length of the video due to the collected data.

The most common questions in the benchmark

The diagram shows the most common benchmark questions. Questions starting with the words “I’m standing”/“I’m sitting” refer to the SQA3D benchmark, where the question consists of two parts – a description of the space + a question testing spatial orientation skills. We can highlight action recognition as the most popular skill. Let us note the variety of questions in terms of wording, which makes the assessment of models more stable with respect to the model’s “memorization” of certain question designs at the training stage.

Generation of instructional Russian-language data

Now we can evaluate the quality of the models’ work on a Russian-language benchmark and calculate metrics on it. The next step is to increase the amount and variety of Russian-language data in model training. To effectively solve this problem, we decided to generate an instructional Russian-language set using LLM based on text descriptions for video recordings.

Basic approach

To generate the instruction set, we adapted the prompt described in the article JourneyDB: A Benchmark for Generative Image Understanding. The main idea is to provide text descriptions for the video and, based on them, create a dialogue between the LLM and the person, as if the neural network itself is looking at the video recordings and not reading the text. For greater variety, we used themes such as films, cartoons, graphic design and animation. Below are examples of descriptions.

Movies:
A woman in a dark dress walks in a dark room and says something, she approaches a man, straightens her hair and says something. The walls are made of red brick, smeared with cement.
Cartoons:
In the Soviet cartoon “Barankin, be a man!” Five white butterflies fly across the blue sky. They land on the brown roof of a house. Antennas are visible on the roof of the house.
Animation:
Beams of energy move from screen to screen, from one panel to another, lighting them up in different colors.

Next, we generated Russian-language dialogues based on the descriptions, following this plan:

We empirically adapted the prompt to extract context and questions to generate Russian-language instructional data. Questions must relate to the visual content of the video highlighted by this prompt. Example result:

{
 "Style": ["туристический", "информативный", "спокойный"],
 "Content": [
   "достопримечательности",
   "гид рассказывает о здании",
   "архитектурные элементы"
 ],
 "Atmosphere": ["спокойная", "солнечная", "барокко", "интерес"],
 "Оther": ["музыка на фоне", "шум людей"],
 "Questions": [
   {
     "Question": "Сколько человек в туристической группе?",
     "Answer": "Десять"
   },
   {
     "Question": "Какой архитектурный стиль описывал гид?",
     "Answer": "Барокко"
   },
   {
     "Question": "0 чём рассказывает гид?",
     "Answer": "Здание"
   },
   {
     "Question": "Какая погода была на экскурсии?",
     "Answer": "Солнечно"
   }
 ]
}

After receiving the dictionary with questions, we checked it for correctness using the following algorithm.

Initialize the correction line.

clarification_prompts = ['\n Пример плохого ответа, так как']

We check the structure of the generated dictionary, the presence of key fields, the presence of questions and answers, and supplement the corrective line with the appropriate modifications.
```
if questions_json.get('Style') is None:
  clarification_prompts.append("он не содержит поля 'Style',")
```

If there are modifications in the corrective line, we supplement its original prompt and try to regenerate the result. In this way, we give the model an example of a bad answer in order to take into account previous mistakes and not repeat them.
We try to correct it up to 15 times or until we get the correct dictionary with questions. In total, about 10% of the dictionaries were eliminated at this stage.

Based on the result of the previous paragraph, a dictionary with the final dialogue was generated. Without further ado or new reasoning, we just brought everything to the desired structure and expanded the answers a little.

[
 {
   "role": "user",
   "question": "Сколько человек в туристической группе?"
 },
 {
   "role": "assistant",
   "answer": "В туристической группе десять человек."
 },
 {
   "role": "user",
   "question": "О чём рассказывает гид?"
 },
 {
   "role": "assistant",
   "answer": "Гид рассказывает о здании и его архитектурных особенностях."
 },
 {
   "role": "user",
   "question": "Какой архитектурный стиль описывал гид?"
 },
 {
   "role": "assistant",
   "answer": "Гид описывал стиль барокко."
 },
 {
   "role": "user",
   "question": "Какая погода была на экскурсии?"
 },
 {
   "role": "assistant",
   "answer": "На экскурсии была солнечная погода."
 }
]

The next step was to slightly modify the prompt for data generation – now, instead of general questions to understand what is happening (perception) and answers to them, the model had to create dialogues that included reasoning and analysis of the cause-and-effect relationships of what is happening in the video (reasoning). All points of the previously described approach were left unchanged, but the request text was slightly changed. Example of final dialogue:

[
 {
   "role": "user",
   "question": "Почему в группе именно десять человек?"
 },
 {
   "role": "assistant",
   "answer": "Это количество позволяет создать комфортные условия для общения и взаимодействия с гидом, при этом не создавая больших толп."
 },
 {
   "role": "user",
   "question": "Какие особенности здания выделил гид?"
 },
 {
   "role": "assistant",
   "answer": "Гид акцентировал внимание на уникальной симметрии и сложных декоративных элементах в стиле барокко."
 },
 {
   "role": "user",
   "question": "Зачем гид упоминал барочный стиль?"
 },
 {
   "role": "assistant",
   "answer": "Цель заключалась в том, чтобы подчеркнуть величественность и историческую значимость здания, используя архитектурный контекст."
 }
]

The total generated dialogues were:

Subject	Number of questions (perception)	Number of questions (reasoning)
Movies	18,000	23,000
Cartoons	5000	25,000
Animation	2000	2000
Total	25,000	50,000

This approach allowed us to save time and money on manual marking of video recordings, as well as improve the metrics on the Russian-language benchmark.

Experiments and metrics

Experiments

For experiments with Russian-language instructional data, we chose the basic pipeline Video-LLaVA and model LLaMA 3.1 8b Instructions as an LLM.

We conducted two experiments:

basic experiment: on the Video-LLaVA set (most of the dialogues are in English, there are also Chinese and Korean) – 890 thousand picture-text pairs, 800 thousand video-text pairs and 90 thousand dialogues without a visual component;
experiment with Russian-language instructional data: the set from the previous section was added to the basic Video-LLaVA set at the second stage of model training (when both the adapter and LLM are trained) – another 75 thousand video-text pairs.

We tried adding Russian-language video-text pairs at both stages of model training (including the first, when we trained only the adapter), but did not notice any significant changes compared to the second experiment. Let's see how this affected the results on our benchmark.

Multiple Choice Questions

Both models responded almost identically to closed-ended, multiple-choice questions. This is the expected result, because in such tasks the model should return only the letter corresponding to the correct answer, and the skill of generating detailed answers from videos in Russian does not test these tasks. The metric in the “Culture Code” category increased slightly (by 1.7 p.p.) due to the subcategories “Movies” and “Cartoons” included in it, which is understandable: data from these categories was present in the Russian-language training set.

Open questions

Here the results were already more interesting. We took as a basis the approach to evaluating answers to open questions from Video-ChatGPT and Video-LLaVA, which was discussed in the previous part: when the correspondence of the answer to the reference is assessed using the GPT assistant. The Accuracy metric calculated in this way for almost all benchmark topics has increased noticeably (that is, the correspondence of the content of the answers to the references has improved).

In addition, we assessed the frequency of occurrence of English words in the responses of each of the models. In most cases, the presence of English words in the model's response meant that the model mistakenly switched to English at an arbitrary moment in the response. But sometimes the inclusion of words consisting of Latin letters was justified: for example, if we were talking about a foreign brand of car or household appliance. Therefore, we only took into account the number of “false” occurrences of English words: if they were absent in the reference, but they appeared in the model’s response.

	MC-VideoQA (Accuracy), %		OE-VideoQA (GPT Accuracy), %		False occurrences of English words, % of tasks
Video data	EN	EN+RU	EN	EN+RU	EN	EN+RU
Academic	55.4	55.3	30.3	42.20	8.52	0.1
How-to	44.0	44.0	36.5	51.55	9.2	0.2
Culture_Code	44.2	45.9	31.2	47.68	6.1	0.1
Total	49.0	49.0	33.0	47.0	8.3	0.2

As can be seen from the table, in general, according to the GPT benchmark, accuracy increased by 14 p.p., in individual categories – from 12 to 16.5 p.p. In addition, the model, which was not trained on a Russian-language video set, 8.3% benchmark tasks gave answers at least partially in English where the reference did not suggest this, and the model, trained, among other things, on Russian-language instructional data, did this only in 0.2% tasks.

Conclusion

We described an interesting practical aspect of Video-LLM assessment – measuring the quality of understanding questions and generating answers in natural Russian. The proposed approach to selecting videos, composing questions and selecting metrics made it possible to obtain a benchmark suitable for assessing VIdeo-LLM both in terms of the quality of generation in Russian, and in terms of skills in understanding various topics relevant to our country – specific cultural ones phenomena and current practical areas. In addition, we examined a simple and effective way to improve the quality of answers in Russian by automatically generating Russian-language instructional data.

Thanks to everyone who took part in the research, as well as in the preparation of materials for publication: Roman Krainov @romacckka, Albina Akhmetgareeva @Colindonolwe, Alexander Kapitanov @hukenovs, Karina Kvanchiani @karinakvanchiani.

How to teach LLMs to understand videos? Part 2

How we made a Russian-language video benchmark for Video-LLM

Translation of academic benchmarks

Collecting Additional Data

Example:

Example

Analysis of collected data

Generation of instructional Russian-language data

Basic approach

Experiments and metrics

Experiments

Multiple Choice Questions

Open questions

Conclusion

Telegram bot for insulin therapy

The 3n+1 Dilemma in Java: Caching Recursion

Sedentary Park: A Journey into the Wonderful World of the Pelvis

Pathological hoarding. Plyushkin syndrome and how to work with it?

Another tool for updating dependencies

Implementation of Zabbix in complex monitoring systems. CROC’s experience

Leave a Reply Cancel reply

How we made a Russian-language video benchmark for Video-LLM

Translation of academic benchmarks

Collecting Additional Data

Example:

Example

Analysis of collected data

Generation of instructional Russian-language data

Basic approach

Experiments and metrics

Experiments

Multiple Choice Questions

Open questions

Conclusion

Similar Posts

Leave a Reply Cancel reply