Summary-review of articles on Recsys+Transformers

The mission of recommender systems is to guess human needs. Primitive models are not able to capture hidden patterns of user behavior. However, this task can be solved by modeling a sequence of recommendations (Sequential Recommendation). Transformer-like architectures have recently achieved particular success in modeling sequences. Below is a brief overview of important articles in the field, partially covering the topic of Recsys+Transformers.

SASRec: Self-Attentive Sequential Recommendation

https://arxiv.org/pdf/1808.09781

The authors of the article position their solution as a kind of trade-off between primitive models (Markov Chains) and models that allow capturing long-term dependencies (RNN). MC is not capable of modeling complex dependencies, since only information from 1 or several previous items is used for prediction, while RNN takes into account the entire historical sequence of recommendations, which contains a lot of noise. Transformer models are capable of processing long context when analyzing sequences, but they mainly make their predictions based on information from a small number of items.

The task of recommendations is formalized and adjusted to the Transformers setting by replacing text tokens with items.

SASRec learning process

SASRec learning process

In terms of architecture, everything is standard: MultiHeadAttention, Position-wise Feed-Forward Network, Layer Normalization, Residual Connection, Dropout, Positional Encoding.

To obtain the final measure of item relevance to the sequence, the hidden state of the final transformer layer is scalar multiplied by the item embedding vector in the prediction layer. One of the features is that it was decided to use a shared embedding matrix as both the input of the model and in the prediction layer. Previous models suffered from not being able to capture asymmetric relationships between items, but SASRec is not susceptible to this.

How to train this model? Each initial sequence of items is eventually broken into subsequences, for which the Loss is calculated as the sum of the Loss for the positive example (the real last item) and the Loss for one randomly chosen negative example.

Loss formula model

Loss formula model

The article shows that in sparse datasets the model tends to rely on several previous items, while in non-sparse datasets it adapts to take into account long-term dependencies.

The metrics were calculated using the leave-one-out technique. For each user, the last and penultimate items are added to the eval and test sets, respectively, and the rest are added to the train. The negative sampling technique was also used: for each correct interaction, 100 items were selected according to popularity. Thus, the metric evaluated a ranked list of 100 obviously false interactions and 1 correct one.

SASRec outperformed all previous SOTA models in HR, NDCG on datasets (Amazon Beauty, Steam, MovieLens-1M).

BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer

https://arxiv.org/pdf/1904.06690

Researchers from Alibaba in 2019 first made friends with BERTs and Recsys, introducing the BERT4Rec architecture. In general, the idea logically follows from SASRec – if unidirectional models coped well with Recsys benchmarks, why not experiment with bidirectional models.

In this way it is possible to:

  • Condition the items on the right context, obtaining their better representation, and at the same time significantly increasing the training sample (combinatorial explosion of the number of permutations of masked and unmasked items).

  • Relax the assumption that items are strictly ordered in a sequence that is not always observed.

The model was trained on the MLM (Masked Language Modeling) or Cloze task, masking a certain proportion of random items in the sequence. During the test, it is enough to feed the input a sequence of items with [mask] at the very end, thus suggesting to the model what behavior we expect from it.

BERT4Rec Architecture

BERT4Rec Architecture

BERT4Rec outperformed SASRec on all metrics (HR, NDCG, MRR) on all datasets (Amazon Beauty, Steam, MovieLens-1M, MovieLens-20M).

At the ablation stage, the effect of the failure to take into account the item sequence turned out to be much stronger than on SASRec (On the MovieLens-1M and MovieLens-20M datasets, degradation of more than 50% is observed, while datasets containing short sequences in the metric did not drop much).

Later, an article (https://arxiv.org/pdf/2207.07483) was published, addressing some inconsistencies in the original article that other researchers found. The authors found 40 publications that compared BERT4Rec and SASRec on various metrics and datasets – SASRec won in 36% of cases, which confirms the hypothesis that BERT4Rec is not always consistently better than SASRec, and that not everyone can reproduce the results of the original article.

Four different implementations of BERT4Rec were compared: original, RecBole, BERT4Rec-VAE, and the one proposed by the authors.

Despite the fact that there is an opinion that negative sampling for calculating metrics is not the best idea, the main metrics were calculated with it – to make the research comparable with the original.

Key checkpoints from the article:

  • BERT4Rec shows different results for different implementations.

  • BERT4Rec is indeed the best architecture for sequence modeling, but it takes much longer to train (30x longer!).

  • Other BERT-like models used in BERT4Rec are able to significantly improve the metrics.

gSASRec: Reducing Overconfidence in Sequential Recommendation Trained with Negative Sampling

https://arxiv.org/pdf/2308.07192

This paper proposes that the main reason why SASRec underperforms BERT4Rec in metrics is overconfidence, which is caused by negative sampling, where positive interactions have a larger proportion relative to negative ones, which is at odds with real life. Thus, the model is trained to distinguish good items from bad ones, while the task of sorting good items to the top is ignored. The reason for BERT4Rec's superiority is not bidirectional attention, but the rejection of negative sampling, however, this makes it impossible to apply this model to many real-world problems due to the complexity of computations, since the Softmax used by BERT4Rec requires storing N*S numbers for each sequence, where S is the number of items in the sequence, N is the number of items in the dataset. The authors also propose a new model that outperforms BERT4Rec itself – gSASRec, trained on a new Loss function – gBCE.

gBCE formula.

gBCE formula.

In general, gBCE is a generalization of BCE – it includes β, which controls the shape of the sigmoid. The article analytically shows that BCE, with a very large number of items in the dataset, equates the probabilities of top items to almost 1. Experiments have shown that with the right combination of hyperparameters k (n_negatives) and t (calibration parameter that actually determines β), SASRec achieves comparable performance with BERT4Rec and even outperforms it on some datasets. It is also noted that gBCE has a greater positive effect on SASRec than on BERT4Rec, since the former is tasked with predicting the next item during training, which is closer in spirit to Sequence Modeling than MLM. And the final advantage of gSASRec is that it trains much faster.

TRON: Scaling Session-Based Transformer Recommendations using Optimized Negative Sampling and Loss Functions

https://arxiv.org/pdf/2307.14906

Session-based recommender systems predict the next item based on the session – the sequence of items it has interacted with previously. There are several ways to select negative examples:

  1. From a uniform distribution over all items

  2. From the distribution obtained from the frequency of interaction of each item with each user

  3. Combination of 1 and 2.

Negative examples can be sampled one at a time: for each positive interaction (elementwise), for each session (sessionwise), for each batch (batchwise). The element-wise approach suffers from the fact that it contains a special computational complexity, while other approaches are faster.

TRON uses a combination of sessionwise and batchwise approaches, while using top-k sampling. All this on top of SASRec with Sampled Softmax Loss.

Experiments were conducted on 2 variants of sampling technique combinations: uniform+sessionwise, batchwise+sessionwise, within the framework of which it was found that with a sufficiently large number of negative examples, TRON wins all other models included in the comparison in most cases.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *