How we launched automatic video moderation in Avito ads

Hello! I'm Vladimir Morozov, senior DS engineer in the Avito moderation team: I mainly do auto-moderation of videos, but I also develop other projects.

It's me

It's me

In the article I tell you what difficulties we encountered when moderating videos with a small amount of data, and how we solved them. I think the material will be useful to everyone who deals with similar tasks in large food companies.

What's inside the article:

Why does Avito monitor videos and what general approaches to moderation are there?

What was our first video moderation system and what problems did we encounter?

How we collected data for automatic moderation

What technologies are currently used in moderation – I analyze them by data domain

How we analyze the work of the auto-moderation model

Instead of conclusions

Why does Avito monitor videos and what approaches to moderation are there?

Moderation in Avito is the identification of problems in the advertisements that people publish on the site. Postings are checked for compliance with the service rules and may be rejected if something is wrong.

Now you can add videos to ads – this helps buyers choose products faster, and sellers help push interested people into a transaction. But the video also needs to be moderated, because there may be violations.

There are different approaches to moderation, and their choice depends on the task. For example:

Pre-moderation or post-moderation. If the risks of violations are high and moderation is fast, it is better to check the ads before publishing. And if there are few risks, then you can use post-moderation.

Manual or automatic moderation. Manual moderation is easier to implement: it is enough to attract people so that they start looking for violations. But there are difficulties: for example, you need to ensure that they work conscientiously.

Automatic moderation also has its downsides—I’ll talk about them in more detail below—and it’s more difficult to launch. But compared to the manual approach, it has many advantages that help it scale quickly: such moderation is cheaper, faster and gives a more predictable result.

What was our first video moderation system and what problems did we encounter?

Video is a new domain that Avito has not worked with before. In addition, quickly building automatic moderation is problematic. Therefore, at the first stage we made the following system:

The first approach to moderating video ads

The first approach to moderating video ads

The essence of the approach: if there are no violations in the ad itself, we publish it, but first without the video. Then the video goes through pre-moderation and manual moderation. If everything is in order, it appears in the ad. We did this so as not to increase the moderation time due to the ability to add video.

Now let's talk about the violations that we look for during moderation. The task can be divided into two components:

Since all videos previously only went through manual moderation, all the shortcomings of such a system became apparent. Namely:

Therefore, we decided to begin a gradual transition to automatic moderation.

How we collected data for automatic moderation

When they came to our department to create automatic video moderation, we thought that we would have a lot of data of good quality and with explicit markings, so that we could easily train the models. The reality turned out to be more complicated and here's why:

Therefore, for a large number of tasks we had to use open datasets, parse YouTube and various video stocks.

There were funny datasets – for example, a dataset on violence, where a couple of people in a room acted out different scenes. But it was not clear from the reaction whether this was violence or not: one pretended to hit the other with a stick, and at the same time both laughed.

What technologies are currently used in moderation – I analyze them by data domain

As a result, we collected the videos and divided them into the domains that I wrote about above: video sequence, audio, metadata and information from the ad. I'll tell you about each one.

Video moderation. Here's how our model works:

Here we do not use any additional features, such as audio, because at the start of the project this significantly complicates development.

Scheme of how video content moderation works

Scheme of how video content moderation works

But sometimes it happens that there is no data at all, and there is no time to collect it. For example, you want to quickly cover some not very important reason. For example, reject videos if hookah appears on them.

Then you can use, for example, Zero Shot approach for CLIPsince it was trained to match text and images. We encode search prompts and frames, and then the class probability is calculated as the maximum speed between the prompt and the search of all frames:

Zero Shot models in video moderation

Zero Shot models in video moderation

But in fact, the connection between frames is not always important, it is mainly action recognition or some complex tricks of the violators. So if your company has picture classifiers for finding violations, then you can try them too.

Also, a large layer of violations lies in logos – for example, unscrupulous users can lead people to competitors in the video. We also learned to find such problems using a detector and then a vector search in the logo database:

How we look for logos in videos

How we look for logos in videos

Identifying faces is a little more complicated – it is also important to find them, because unscrupulous users can, for example, violate copyrights and take other people's photos. Here we run the frames through some kind of lightweight classifier, which accurately tells whether there is an image on the frame. We leave only the frames where there is definitely a violation, and compare the pictures with the faces that we have in the database:

How we look for faces in video

How we search for faces in video

Moderation of texts in videos. A large layer of violations lies in the texts on the video – for example, there may be insults or an attempt to divert competitors to websites.

This is where OCR comes to the rescue, which can give us all the text from the video, and then we need to find violations in this text. As a baseline, especially when you don't have data but know what you want to find, you can use regular models:

How we find text with potential violations among frames

How we find text with potential violations among frames

Audio moderation. The work here is similar to what happens with OCR, only we use a transcriber for audio. As a baseline we took Whisper – it is multilingual and works very well, – and additionally trained on Avito data. Next, the scheme is as follows: audio arrives at the input, we transcribe it through Whisper and run it through all sorts of text classifiers to search for violations:

Audio moderation scheme

Audio moderation scheme

Another part of the audio disturbances can be found not in speech, but in sounds – for example, loud extraneous noises. To cover such cases we use an audio classifier Audio Spectrogram Transformerwhich perfectly finds various irregularities in sounds.

Audio classifier Audio Spectrogram Transformer

Audio classifier Audio Spectrogram Transformer

Video quality moderation. To exclude videos of very poor quality, you can analyze the video metadata: look at bitrate or FPS (frames per second). Bitrate shows how many bits change when a new frame changes, and FPS shows how many frames change per second. Some basic classifier can be trained on metadata to predict video quality.

Check: whether the video is suitable for the ad. Here we also decided not to reinvent the wheel and will reuse CLIP: we get embeddings for all frames, we do the same for pictures and text in the ad. Then we aggregate all these embeddings and run them through the classification head, which tells us whether there is a violation of this type or not:

How we check if a video is relevant to an ad

How we check if a video is relevant to an ad

What is the result: a general scheme of auto-moderation. This is not one big model, but rather many models of varying complexity – they interact and together say whether there is a violation in the video or not.

The general moderation scheme looks like this

The general moderation scheme looks like this

Nowadays, most videos undergo automatic moderation and are published on the site much faster than before, when everything worked manually. Currently, only those videos where automatic moderation has detected violations are manually checked, and this is a small percentage of videos.

The first system vs the current video moderation system

The first system vs the current video moderation system

How we analyze the work of the auto-moderation model

Usually they look at two types of metrics:

How we analyze the effectiveness of moderation

How we analyze the effectiveness of moderation

To evaluate ML metrics, we introduced two main changes: we changed the video markup instructions and added a small sampler that sends part of the video where we did not find violations to manual moderation.

If auto-moderation finds a violation, we can find True positive And False positive. If auto-moderation does not find anything, the main part of the video is published on the site, and some sample is sent for manual moderation, from which we extract False negative And True Negative to evaluate ML metrics.

Instead of conclusions

As a result, thanks to a competent decomposition of tasks and the use of already existing solutions, we were able to cover all the key causes of violations in the video without the need to “reinvent the wheel.” This helped optimize video moderation processes, increasing their efficiency and reducing costs.

In addition to moderating complex domains like video, our team has other ambitious projects. For example, automatically correcting violations for the user, using few-shot methods for texts and images, blocking duplicates, and all this – in a highly loaded system with a great impact on the entire company.

Thank you for your time for this article! For any questions about our experience in video moderation, I I will be happy to answer in the comments or write to mecontacts for communication:

And if you are interested in such tasks, we invite you to join our team – you definitely won’t be bored! About how to do this – follow the link.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *