Emotional and Artificial. Teaching Neural Networks to Understand Human Social Interactions at AIJ Contest

This year we decided to go further and shift the focus to analyzing video and audio in human interaction. We named it appropriately — Emotional Fusion Brain 4.0. Below are the details and details of the competition.

This year, our Emotional Fusion Brain Challenge 4.0 track at the AI ​​Journey Contest took a step towards developing the emotional artificial intelligence of an AI assistant. Participants will have to develop a universal multimodal model for understanding video recordings of human social interactions, thereby improving the visual perception of external emotional manifestations and human behavior.

The participant's model must be able to work with three input modalities: video, audio, and text. Each video recording will be accompanied by a series of questions in English about the development of the plot and the events occurring in the recording. The answers to the questions generated by the multimodal model will determine how successfully it has managed to understand the content and the level of detail of this understanding.

In order to evaluate the capabilities of the developed model from different points of view, we collected a test dataset that covers three main tasks:

  1. Video QA — a task that requires clear and unambiguous answers to questions about a video recording. The model must extract important visual information from a set of frames, create a fully coherent story based on it, and link it to knowledge about the world in order to correctly interpret people's behavior and emotions.

  2. Video‑Audio QA — is a complication of the standard task of answering questions on video, using audio modality as an important source of information. To fully understand a person, a multimodal model must be able to analyze the tone, pitch, and strength of your voice, since these are the keys to our emotional state.

  3. Video Captioning — a task that aims to provide a basic understanding of visual storytelling in a video by an AI model. We expect the model to be able to qualitatively identify important details and properties of objects throughout the entire video sequence.

In both Question Answering (QA) tasks, we provide the model with answer options for a given question and expect the number of the preferred option as a result. And in the Captioning task, the final answer must be presented as text.

The solution is a JSON file of a specific format, generated as a result of the participant's solution. Two metrics will be calculated on its basis.

The quality of answers to questions (QA tasks) is proposed to be assessed using a classification metric Accuracy (proportion of correct answers), which is based on an internal assessment of the model's confidence in the answer options for a question based on a video recording. And as a metric for assessing the generation of model responses to the task of detailed video description (Captioning), it is proposed to use the well-known metric METEOR. Final evaluation of the multimodal model (IThe integral metric) will be formed by aggregating the values ​​of quality metrics for all types of tasks.

The winners will be determined based on a private leaderboard formed based on the results of the integrated metric. I (the higher the value Ithe higher the participant's rating on the leaderboard).

The competition will also introduce two additional nominations for the top 10 teams on the private leaderboard:

  • As the first additional nomination, it is proposed to apply the developed multimodal model to video recordings of a role-playing game. Based on the provided video recordings from the game, the model will have to answer questions about the process of the game itself, determine the roles of the participants and evaluate the plausibility of the theses voiced and demonstrated by the players. The more correct answers the model has, the higher its position in the rating for the first additional nomination.

  • The second additional nomination has already become traditional – “The Fastest Solution”. Within the framework of this nomination, the developed models will be assessed by the metric of the shortest inference time, where the smallest value corresponds to a more optimal and faster solution.

You can join AI Journey Contest 2024 either on your own or as part of a team – the main condition is that all participants are over 18 years old.

You can get acquainted with the tasks today, and the solutions must be uploaded to the DS Works platform by October 28, 2024.

This year, the winners of this competition will share the prize fund – 2.5 million rubles. Information about the organizer and full rules of the competition – on website.

Good luck!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *