How to create a video search assistant

Hi all! My name is Georgiy, I am a senior research developer at MTS AI. One of the tasks I do at the company is smart video analytics. This is a powerful tool, especially given modern artificial intelligence technologies, that can be used in many industries: from trade to customer service.

At the same time, today's video analytics systems have a significant limitation: they are tailored for narrow tasks and specific types of events – for example, license plate recognition, border crossings, face detection. Of course, progress does not stand still, and over the past year many multimodal models have appeared that can answer a wide range of questions on video – but they only work on very short videos and require serious investments in hardware.

However, imagine that you could create a general video analytics system that is not pre-configured for specific events. She is flexible and knows how to understand tasks while communicating with the user. Requests can be varied, for example: “warn me if an emergency occurs in the frame, for example, a fire or a fight” or “I want to find footage of yellow taxi cars.”

Is it possible to find an approach in which the system can answer a wide range of video questions, but at the same time be able to process long videos and remain undemanding on hardware? In this article I will talk about one of the ways to create such a solution – using the example of video search.

How to make such a system?

The first thing that comes to mind when you want to assemble a text search for a video is to split the video into frames (or small videos), translate them into text using a large model, and then apply some standard RAG (Retrieval Augmented Generation) code.

RAG, or Retrieval-Augmented Generation, is a technique used in artificial intelligence that combines information retrieval and text generation. The RAG system first searches for relevant information from a large amount of pre-indexed data (in simple terms, “Googles” your database), then it generates a response using the information it retrieves.

This path comes to mind for many. I also chose it first. To translate video into text, I sorted through multimodal networks – starting with top and heavy ones, based on 7B language models. This path turned out to be working, but a dead end: inference of such models requires about 16GB of video memory and lasts from several seconds for each processed frame – too slow and expensive for mass indexing of long videos.

The next option was the lightweight BLIP model (not the famous BLIP-2 but just BLIP). With the right settings, it gives a detailed description of a frame on the GPU in just 0.5 seconds, and weighs several times less than modern multimodal cameras – while still offering a good text description of frames. For a while, BLIP was enough for me.

But then it seemed too heavy. I started looking for ways to optimize – and found it. All multimodal language models consist of an image encoder and a language model. It is the image encoder that gives me what I need – it translates the image into a vector (set of vectors) of some semantic space. The language model is simply an extra burden at the data indexing stage. I removed LLM from my code, leaving only the lightweight image encoder. This accelerated inference by 1-2 orders of magnitude. The following diagram describes the result of such work.

The chat assistant for video search can be built by analogy with RAG, known to many NLP specialists.  Green arrows show the video indexing process (done in advance).  Blue arrows – communication with the user after the data has been indexed.

The chat assistant for video search can be built by analogy with RAG, known to many NLP specialists. Green arrows show the video indexing process (done in advance).
Blue arrows – communication with the user after the data has been indexed.

Description of the circuit

This system consists of three key elements:

  1. To quickly search, data needs to be indexed. To do this, you need a Vector database – it allows you to find for any given vector (query) the k closest vectors in approximately constant time (O(1)). One of the most popular tools for such a quick search is FAISS (Facebook AI Similarity Search).

  2. In order to be able to work with this vector database, it is necessary to project frames from the video and the user’s text questions into a single space. In the diagram, the Frames embedder and Query Embedder are responsible for this one. For these purposes, you can use a single model from the CLIP (Contrastive Language-Image Pre-training) family – these are models based on the ViT transformer, which projects images and text descriptions into a single vector space. You can read about these models on the OpenAI website. You can select the required checkpoint from laion or EVA CLIP – they recently rolled out a monster with as many as 18 billion parameters. In my code, I took, on the contrary, one of the easiest from the available CLIP models.

  3. Next, we need additional code to work with video so that we can get the frames we need. I used ffmpeg and OpenCV in my code.

These three elements are already enough to build a system for searching frames in a video based on their text description. However, you can add a little more code to improve the user experience:

  1. Telegram bot interface written in the telethon library. Refusal from the usual telebot in favor telethon dictated by the desire to bypass the restriction on uploading files to the bot (when using telebot and the standard Telegram API, the bot cannot transfer files > 20 MB).

  2. The CLIP family models are trained in English. An additional call to LLM (you can access any good model, the easiest way is via the API) allows you to turn the request “find me a cat in this video” into “please find me a cat in this video” (or better yet, just “cat”). In my code, I access the API to our MTS AI Chat model. Translating the request into English is not an ideal solution. For example, the word “pen” has completely different translations in English (a pen, a handle, an arm), which can create problems in some cases. The ideal solution would be a “Russian-language CLIP”.

  3. Found footage is great, but it will be even better if the bot also comments on it. This is where a multimodal LLM comes in handy. Firstly, multimodal LLM improves the quality of service – it will write an answer to the user’s question based on the images found. Secondly, if the pipeline returns frames that do not answer the question (False positive), multimodal LLM can remove them from the search results. To select offline multimodals, you can start from the benchmark BradyFU. Or use the API. While I haven’t deployed my own multimodal API, for the telegram demo I decided to temporarily resort to the GPT-4 Vision API in low detail mode. I would like to emphasize that these calls are just a cosmetic touch in the general code; the search itself works offline without accessing heavy and paid APIs =)

What happened?

You can try how the resulting scheme works via telegram. Bot is open: https://t.me/CamerOn_Video_Search_Bot

Does he find everything? No, not all and not always. But it handles most requests quite well. Sometimes it helps to clarify the query – for example, “people” may not be found in the video, but “three men at a festive table” may be found. Most likely this is due to the selected model image encode—a. If there is definitely something in the video, but you don’t have it, try rephrasing the request.

The assembled pipeline was tested on our internal VQA (Visual Question Answering) dataset. It differs from conventional VQA for multimodal LLMs in that the length of the videos was measured in tens of minutes), the questions are close to the real requests of video surveillance systems (situations such as conflicts, road accidents or open fire). Despite the lightest image encoder, the coverage (recall) on our test dataset reaches 80%. By the way, when switching frame descriptions from CLIP embeddings to text descriptions, not only the speed of work drops (10-100 times), but also the metrics – RECALL drops to the level of 35-70% (depending on which LLM analyzes text descriptions found footage). But even with the best LMMs, the result is worse – most likely the reason is that when translating data from CLIP embeddings into text, information is lost – no matter how good the adapters are.

How resource intensive is this?
The pipeline does not use any heavy elements (of course, there are calls to heavy APIs – but they are optional!). The only integral neural network – laion/CLIP-ViT-B-32-laion2B-s34B-b79K – weighs 600 MB. This allows you to easily run video processing code on a laptop, and also, theoretically, on weaker hardware – even single-board computers (the author did not check this – the model should fit in RAM). Connecting language models, including multimodal ones, is optional (searching for frames by text description will work without LLM), they can be deployed remotely. You can run this code and get text descriptions without multimodal networks – clip embeddings can be projected into certain words (tags, classes) through a trained linear layer (but better through an intermediate transformer, as we did in the work BLIP-2).

On the GPU, this python code processes the video about 10 times faster than it is running. This is not the limit – the frame processing itself via CLIP takes milliseconds – but it will do for demonstration.

To demonstrate how the code works, it would be most convenient to take well-known films (the bot does an excellent job of searching for long videos), but this required resolving copyright issues.  So I simply filmed the objects around me while writing this article and ran the bot on the resulting video.

To demonstrate how the code works, it would be most convenient to take well-known films (the bot does an excellent job of searching for long videos), but this required resolving copyright issues.
So I simply filmed the objects around me while writing this article and ran the bot on the resulting video.

Afterword: real-time configurable system

This approach to video analytics – working with common frame descriptors – theoretically allows the system to be configured for specific user requests using real-time communication

The process of user interaction with the system begins with communication through a chatbot. For example, the user will be able to receive alerts when a garbage disposal and a street cleaner appear in the video at the same time. The chatbot performs the functions of the user interface, receiving requests, identifying target situations, and then sending these descriptions for vectorization (via the CLIP model).

The CLIP model converts phrases describing target situations into vectors. It also encodes selected frames from the video stream into vectors with a fixed step (for example, every tenth frame). The peculiarity of the model is that both text descriptions and images are projected into a single vector space. These vectors are then carefully compared with vectors generated from the user's text queries.

Based on a preset mean square error (MSE) threshold, the system can decide that a significant event has occurred that matches the user's alert preferences. For a more accurate assessment, this process can be complemented with a multimodal model to accurately filter out possible false positives. However, it is worth noting that this step is optional.

Finally, the user is notified of the possible identification of the requested event, which will include a still image as visual confirmation.

Flexibly configurable video analytics: Instead of the FAISS vector database, you can install a comparator of real-time embeddings and embeddings from the user’s request (“Fire”, “Snow removal equipment”, “Penetration of people into the territory”) - this will allow each user to configure the video surveillance system for themselves with the help of “just one SMS”.  Green arrows show the setup process (done in advance). Blue arrows - communication with the user after the indicated events have occurred in the frame.

Flexibly configurable video analytics: Instead of the FAISS vector database, you can install a comparator of real-time embeddings and embeddings from the user’s request (“Fire”, “Snow removal equipment”, “Penetration of people into the territory”) – this will allow each user to configure the video surveillance system for themselves with the help of “just one SMS”. Green arrows show the setup process (done in advance).
Blue arrows – communication with the user after the indicated events have occurred in the frame.

Conclusion

Creating video analysis systems using LLM (multimodal or conventional) seems possible, but requires two important conditions:

  1. Translating video for indexing (or other actions) should be done not into text, but into semantic descriptors (embeddings) of one kind or another using models such as CLIP (if a single space with text is important) or models like DINOv2 (if alignment with text is not so important).

  2. Connecting heavy language models (multimodal or conventional) should be done only at key stages of the pipeline (setting up / processing the result) can be a balanced solution that allows you to connect large language models to video analysis systems without resorting to unnecessary loads on computing resources (such , which would be required for frame-by-frame translation into text).

Meeting these conditions leaves room for the construction of a balanced solution that will have almost the same capabilities as multimodal LLMs, but will require almost as little computational resources as existing solutions.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *