Multimodal transformer for content-based recommendations

At first glance, it may seem that nothing interesting is happening in the RecSys area and everything has been decided long ago: we collect interactions between users and products, throw them into some library that implements collaborative filtering, and the recommendations are ready. At the same time, almost all other areas of machine learning have switched (NLP, CV, Speech) or are experimenting (TimeSeries, Tabular ML) with neural network models based on transformers. In fact, recommender systems are no exception, and research on the use of transformers has been going on for quite some time.

We're on the ranking and recommendation team, trying to keep up with the latest advancements in the RecSys field. My name is Dima, I am a Data Scientist at Cyan, and today I want to share our experience of using multimodal transformers for content-based recommendations.


Most recommendations approaches rely on information about interactions between users and products. Despite the fact that these approaches perform well in various domains, most of them have several rather serious drawbacks, the main one of which is the need to use unique user/product identifiers and the resulting need to re-train models on new products/interactions of products and users .

In the era of victorious transformers, it seems strange that collaborative methods, such as matrix factorization, are still used, instead of transformer encoders, which would take as input descriptions of the goods with which the user interacted, and issue the embedding of the product that should be next.

In theory, such a model would avoid the disadvantages of ID-based methods and could work just as well on products and users that were not in the training dataset.

After a little searching, we found the article Text Is All You Need: Learning Language Representations for Sequential Recommendation. The authors propose to represent a product as a set of its attributes, and to describe the user as the attributes of the products with which he interacted. The same encoder is used to obtain embeddings for products and users.

Using Contrastive Learning, the model learns to generate a user embedding such that the embedding of the next product he interacted with will be closer in cosine distance than other products. Recommendations in this case are the k closest products to the user’s embedding.

Let us describe in more detail what is proposed in the article:

Training takes place in 2 stages:

  • Pre-training

  • Two-stage finetunning

Pre-training

The goal of pre-training is to get a good initialization of the fine-tunning weights.

The model is given two tasks: MLM, Masked Language Modeling, where part of the input tokens is replaced by a token [MASK] and the goal of the model is to understand from the context which tokens were masked.

The goal of the second task, item-item contrastive (IIC), is to minimize the distance between the user's embedding and the embedding of the next product in his interaction history and to maximize the distance with the embeddings of the next products from other users in the batch.

L_{\text{InfoNCE}} = -\log \frac{\exp(\text{sim}(q, p))}{\exp(\text{sim}(q, p)) + \sum_{n \in N} \exp(\text{sim}(q,n))}

Two-stage fine-tunning

Fine-Tunning occurs in 2 stages.

At the first stage, product embeddings are recalculated once per epoch, and InfoNCE loss is used again, but only this time the negatives are not the remaining ground-truth elements of the batch, but in general all the elements that are found in the history of all users.

At the second stage, product embeddings are frozen and the fine tuning process continues.

Next, the authors compare their approach with others (ID-Only, ID-Text, Text only) on the Amazon Product Reviews dataset and win against everyone (or almost everyone). The authors also assessed the zero-shot capabilities of the model, where the model is pre-trained on some datasets and tested on others. We compared them with Text-only methods, they won again and concluded that pre-training allows us to obtain knowledge that is transferred to downstream datasets.

Our modifications and application experience

Changes during training

As a result of the experiments, we did not see a significant effect from pre-training, and subsequently abandoned it. Also, the second stage in the two-step fine-tunning process reduces metrics when testing on new products and new users, which is not surprising, because we freeze product embeddings, and changes during the training process no longer affect them.

We use only the first stage of fine-tunning, and re-encode products once every N steps.

Changed loss

Instead of predicting just the next element, let's try to count multiple elements as positive. This can be done in two ways:

Log in: we put the sum of positive elements inside the logarithm

L_{\text{InfoNCE}} = -\log \frac{\sum_{p \in P}\exp(\text{sim}(q, p))}{\sum_{p \in P}\exp( \text{sim}(q, p)) + \sum_{n \in N} \exp(\text{sim}(q,n))}

Log out: sum over positive elements outside the logarithm

L_{\text{InfoNCE}} = -\sum_{p \in P}\log \frac{\exp(\text{sim}(q, p))}{\exp(\text{sim}(q, p)) + \sum_{n \in N} \exp(\text{sim}(q,n))}

We have implemented the Log in option. As a result of our tests, recall began to grow faster than with the old loss. But in the future, it is necessary to conduct an ablation study and more accurately assess the impact of various loss functions on the metrics.

Changes in data

Our data differ significantly from the data used by the authors of the article. The original article focuses on text/words, for example, product names or text descriptions of their attributes, as well as keys in human-understandable language.

Our data is very similar to tabular data: numbers, boolean values, text discrete values. Example of product data:

{
	1:"dailyFlatRent",
	2:"11",
	3:"45.4",
	5: "True",
	6: "False"
}

Images

As you know, a picture is about 256 words, so it would be interesting to see what the model learns by describing products only with pictures. To do this, we extracted image embeddings using EVA-02 ViT B/16 and fed them to LongFormer, additionally passing them through 1 trainable linear layer, which is initialized as an identity matrix so as not to make changes at the beginning of training.

As a result of training, the model learned to approximately determine the price, area, number of rooms and “vibe” of the renovation. We discovered this by analyzing the closest embeddings for different products.

Example of products with the same “vibe” of photos

Subsequently, we submitted the images along with other features and also received a noticeable increase.

Meta information

Next, we tried to enrich our context with additional information about the interaction between the user and products. We added action type and quantized click duration as separate attributes that are served simultaneously with product attributes. This adds some friction between products and users because metadata is not added when we encode products. However, we also saw an increase in metrics.

results

We compared the results for the recall@30 Offline metric:

First level product model (no ranking)

0.28

LongFormer

0.31

LongFormer+meta_data

0.32

LongFormer+meta_data+images

0.34

We received an increase in metrics, as well as pleasant bonuses, such as:

  • The model does not need to be retrained; it shows approximately the same performance over different periods of time, with new products and users.

  • We can recommend a product that has not yet been interacted with to any user who has at least 1 product in their history.

We are currently working on implementing the model in production.

First, we will conduct experiments on item-item recommendations (Similar ads block), this will require the least effort, since recommendations can be calculated offline, and they do not require user history. Next, we will conduct experiments on user recommendations that are delivered using push notifications. They can also be considered offline, here we can test user-item recommendations.

If successful, we will try to embed recommendations online, starting with recalculating new recommendations every N minutes.

On one 2080Ti you can generate recommendations for ~45k users in 35 minutes. Performance can be improved by using quantization and optimization frameworks such as TensorRT, or compiling the model using torch.compile.

Potential for further development

The flexibility of the transformer architecture allows further development of this method:

  • You can use Large rather than base Longformer, which will significantly slow down the training/inference, but may give better results.

  • It is possible that you can achieve greater gains from using images if you extract embeddings using other methods. For example, train VAE/VQVAE/VQGAN on your own dataset, or add some small network, for example EfficientNet, and train it end-to-end together with a transformer to extract from pictures the features that are needed for recommendations. Well, or just use CLIP encoders with a large number of layers and resolution.

  • You can add not only pictures, but also additional custom text: there is a lot of text and inserting text tokens directly will most likely not work, but you can do the following: use pre-trained models, or conduct pre-training BERT/LongFormer on custom product descriptions and insert these text embeddings into context.

Conclusion

We experimented with a promising technology: a transformer that does not need IDs for recommendations. This approach not only shows good quality of recommendations, but also allows us to recommend products faster: there is no need to wait until we get information about the product from user interactions: as soon as the product appears, we already know who might like it.

But the most important thing is that now we have no restrictions on the period of time for which we collect the training dataset: we can use data from the past to improve the quality of predictions in the future: to better understand what is interesting to the user and what he wants to do next.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *