Calculating user tastes for the recommendation feed using the item2vec approach

OK's monthly audience in Russia alone exceeds 36 million people. Moreover, these are active users who interact well with our content: they put Classes, comment, and repost. The key to an active response is largely the formation of a news feed taking into account the preferences of each specific user.

My name is Dmitry Reshetnikov. I am the team leader of the recommendations team at Lenta OK. In this article I will tell you what our recommendation pipeline looks like in the news feed, the place of item2vec in it and the results of implementing this approach.

Pipeline for preparing recommendations for user feeds

At OK we build two main types of news feed:

  • Subscribed – a feed with content that the user has subscribed to. It includes posts from the user’s groups and his friends.

  • ​​Recommendation feed is a feed that is formed from content that the user may like, but is generated by people and groups to which the person has not yet subscribed.

Both user feeds are built in a similar way – a classic two-stage recommendation model is used:

  • at the first stage, from the set of all posts available to the user, we select a subset;

  • at the second stage, using the XGBoost ranker we form a news feed, placing the most interesting user content from the collected subset in the first positions.

A two-stage subscription content recommendation model.

A two-stage subscription content recommendation model.

In the case of a subscription feed, the task of the first stage comes down to searching for content from the user’s groups and friends that the user has not yet seen.

In the case of a non-subscription or recommendation feed, the approach to the first stage changes slightly, since as the initial set of potential posts we have to consider all the content in OK for all time, and not just content from a small set of groups.

A two-stage recommendation model for the recommendation feed.

A two-stage recommendation model for the recommendation feed.

There are many ways to search for posts that interest a user. Let's understand some of them with an example.

So, when registering a new user on the portal, we initially know only his age and gender. Having this information, we can already offer him the most popular posts by CTR (click-through rate) for his social group. Thus, we received one of the sources for selecting posts at the first stage of our pipeline – the source of the most popular posts. This source is suitable for recommendations to “cold” users, but it is not personalized enough.

Let's look at our user some time later: he has already looked at posts from a source of popular content, he liked something, he rated something somewhere, commented on something, maybe he stopped at some longread. Now we know his preferences and are ready to build new sources with more personalized recommendations so that the user returns to us more often.

To obtain such recommendations, we at OK have developed sources of short-term and long-term tastes.

Short-term and long-term tastes

To determine short-term tastes, we remember the last few posts that a user interacted with. Based on this information, we find similar posts and transfer them to the second stage of the pipeline for ranking.

To identify long-term tastes, we analyze user activity over 6 months. Since there is significantly more data about user actions in this case, we first cluster posts for which there is feedback from the user using DBSCAN, and then select medoids – the central posts in each cluster. We consider such medoids to be long-term tastes and, based on them, we look for similar content for subsequent ranking and formation of a recommendation feed.

Algorithm for calculating short-term and long-term tastes.

Algorithm for calculating short-term and long-term tastes.

To cluster posts and search for similar content, we developed the item2vec approach. Let's figure out how it works

Implementation of the item2vec approach

In our implementation, the Item2vec pipeline performs the following steps:

  • All available content types (video, text and image) are highlighted from each post;

  • For each type of content, a separate embedding is obtained;

  • The output is a vector in n-dimensional space.

Implemented item2vec as a two-stage pipeline:

  • at the first level, content models vectorize each type of content separately;

  • on the second, the resulting vectors are concatenated and passed through a fully connected neural network to obtain a vector of the required dimension.

At the first stage we use the following configuration of models:

  • for text – model RoBERTaadditionally trained on a large corpus of Russian-language text;

  • for pictures – model CLIPdeveloped by OpenAI;

  • for video – a transformer on top of efficientnet embeddings of 16 evenly distributed frames from the video, trained by us on storyboards from the video.

At the second level, item2vec is a fully connected linear neural network.

Dataset collection and validation

To train the second-level model, we collect a dataset with information about which posts the user consistently interacts with within one session – in our implementation, we believe that if the user likes both posts, then they can be considered somewhat similar. We want such posts to be close to each other in our n-dimensional space.

Having collected pairs of such positive interactions, we train our second-stage model and are able to collect embeddings.

Next, we proceed to check the quality of the model.

Offline validation

First, we perform offline validation – checking immediately online with users is painful and expensive.

For offline validation tasks, we collected a reference dataset, and then categorized it into 23 categories and 60 sub-categories.

Chart

Distribution of categories in the reference dataset.

For each post, we look for the closest posts in the new obtained n-dimensional space and calculate the Map@K metric on these posts. If the metric has improved, the model is of high quality and can be passed on for online validation through A/B testing.

Online validation

We implement online validation using the OK A/B platform. It automates most of the processes necessary to build a correct experiment, select an audience, and automate calculations. Thus, in order to test the model in combat, the data scientist can only look at the results of the experiment and make a conclusion whether the model can be released into production or whether some modifications are needed.

You can read more about our A/B platform Here And Here.

Item2vec in production

To understand the functioning of the item2vec pipeline in production, let's look at two services necessary for its operation: Vector Index and Feature store.

Vector Index — a service for fast approximate search of nearest neighbors in high-dimensional vector spaces. It allows you to stream objects from Kafka topics for storage in the FAISS index. In it we will store vectors of posts to search for posts with similar content.

You can find out more details about this tool by looking at performance Andrey Kuznetsov at codefest last year.

Vector Index architecture.

Vector Index architecture.

Feature store — a service for delivering features in batches and streaming, which supports flexible feature storage formats and is capable of working with large volumes of data (billions of objects, 25TB). In this service we will store for each user vectors of his long-term and short-term tastes.

My colleague Andrey Kuznetsov spoke in more detail about the work of our feature store. Here.

Feature store architecture.

Feature store architecture.

Final pipeline

As a result, the architecture of our service in production looks like this:

  • Groups generate posts for the recommendation feed – typically more than 400 thousand posts every day.

  • New posts arrive in the Kafka topic and sequentially pass through all content models.

  • After processing, content embeddings are sent to Item2vec, which converts them into lower-dimensional vectors and stores the resulting vectors in the Vector Index.

Loading embeddings into Vector Index.

Loading embeddings into Vector Index.

Next, when the user application is loaded, the following happens:

  • the backend receives a request from a client, for example, from an Android application;

  • the backend contacts the Feature store and, using the user ID, requests his short-term and long-term tastes;

  • With data on short-term and long-term tastes, the backend contacts Vector Index, where it receives upcoming posts;

  • these posts are sent to the ranker, where the XGBoost model ranks them and returns a sorted list of posts to the backend;

  • The backend sends relevant content to the user's application.

Processing a request from a client.

Processing a request from a client.

At the same time, all complex logic is implemented in a split second, and the user immediately sees posts in the news feed that match his tastes.

Example of recommendations.

Example of recommendations.

conclusions

Vectorization of posts and the use of the described recommendation selection pipeline allowed us to obtain high-quality sources of long-term and short-term tastes for the recommendation feed – at the time of release, they became the most clickable sources among all sources of the non-subscription feed.

Based on the results of A/B tests, new sources improved the recommendations feed metrics:

  • the number of users reading more than 40 posts increased by 4%;

  • User time in the feed increased by 3%;

  • the amount of positive feedback increased by 2.5%.

At the same time, we do not rest on our laurels and continue to develop the tool: we are trying new models for obtaining content embeddings, experimenting with the structure of the second-level model, looking for and finding alternative methods of application.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *