How machine learning improves site directory recommendations by 80%. Increasing the efficiency of collaborative filtering

Only 20% of product recommendations during catalog searches are based on behavior. Is it so? Let’s find out!

Product recommendations have become an integral sales tool for e-commerce sites. These recommendation systems typically use collaborative filtering technology, a common approach for creating recommendation systems based on user behavior. Collaborative filtering is possible when there is sufficient historical data about user interactions with interface elements, and it is ineffective when interaction data is collected insufficiently or not for all actions. According to the Pareto principle, typically 20% of a site’s directory receives 80% of the traffic, and the rest of the directory does not have enough user interaction data. This is precisely the challenge for implementing behavior-based recommendations.

When collaborative filtering cannot be applied, you can use content-based recommendations, that is, find products by similar appearance, characteristics or description. However, by using machine learning, we can make the collaborative filtering approach effective even for products with minimal customer interaction data. Let’s discuss how to train an ML model to map collaborative filtering characteristics to provide behavior-based recommendations even for products with incomplete data.

Modeling user behavior using an interaction matrix

Let’s first take a closer look at the collaborative filtering approach. We model the interaction between users and products as an “interaction matrix,” where the value of a cell is the probability that a given user will choose a given product. We only have partial information about the contents of this matrix based on explicit or observable customer behavior. The explicit behavior may be a rating or other evaluation of the product provided by the customer. Observable behavior shows how much a customer likes a product through implicit signals, such as when a customer views a product, adds it to a cart, or purchases it. The matrix cells remain empty for users and products that have not yet interacted. Our goal is to build a model that can predict the values ​​of the empty cells of this interaction matrix.

One of the classic and still one of the most popular approaches for populating the interaction matrix is ​​matrix factorization. We are trying to approximate the interaction matrix as the product of two lower-dimensional matrices: a matrix of user attributes and a matrix of product attributes.

The dot product of the selected rows of matrix U and columns of matrix V predicts the success of the recommendation for the selected user and product. Predictions for known user/product pairs should be as close as possible to known relevance scores. We achieve matching relevance scores for known data using optimization algorithms.

After training, the user and product matrices contain hidden factors called embeddings, which represent user preferences and product features. Embedding of both the user and the product is a set of numbers (vector). Each embedding element conveys a specific aspect of user preferences and product characteristics, which is useful for predicting their behavior. In a well-trained model, similar users and products are grouped together in a latent embedding vector space, which gives the model predictive power.


Application of the ALS method

One of the classic approaches to matrix factorization is the Alternating least squares (ALS) method.

In this approach, we define a loss function L and iteratively modify the user and product vectors to minimize it. The first part of the loss function is the square deviation of the predicted rating from the true value. The second part is regularization, which helps prevent overfitting by reducing the norm of embedding vectors

The process is iterative. At each step, a local optimization problem is solved for one matrix, taking into account the values ​​of the second matrix as fixed constants. Then the optimization problem is solved for another matrix, and the process is repeated. Thanks to this iterative process, the loss function at each step is convex, it has no local minima, and the optimization process converges very quickly to optimal values.

Matrix factorization creates product embeddings when there are relevance scores for many users and products. However, if we are talking about rare goods and new users, then this methodology is not applicable, since the embeddings created by this approach are filled mainly with random values.

Visualizations of embedding clusters that appeared after ALS training

Visualizations of embedding clusters that appeared after ALS training

In the center of the graph you see a space of elements with a small number of user interactions. These are products with a small number of interactions, where the number of interactions is less than 300. Green dots are products with a large number of interactions, for which embeddings there is a clear cluster structure after ALS training.

Using ALS embeddings, we can receive not only product recommendations to users based on their interaction. Collaborative filtering methods also allow you to build metrics between products by implementing recommendations of similar offers or alternatives to the recommended one. This uses a nearest neighbor search in the hidden space of product embeddings. However, the quality of such recommendations also suffers greatly when interaction data are scarce.

The chart below shows that as the number of interactions decreases, the percentage of irrelevant products in recommendations increases. This means that for rare products or new users who do not yet have interaction data, the ALS method cannot offer suitable recommendations.

Hybrid solutions for selection of recommendations

We need to find another approach that preserves the structure of the ALS approach with building embeddings for products and users, allowing us to flexibly solve the problem, but also works effectively for “long tail” products.

We consider deep learning (DL) methods to solve the recommendation problem. Because DL models are universal function approximators, they can predict ALS embeddings for all products, including those with few interactions, using descriptive information about products and users. For example, product attributes, title and description text, and product image will be used as information to predict ALS product embeddings.

The goal of training a deep neural network is to learn to predict embeddings that are meaningful for ALS recommendations for products for which there are a sufficient number of interactions, and ALS works effectively for them. As a result of this approach, the network learns to predict possible behavioral attributes encoded in ALS based on the available information – images, text and attributes. This process allows you to obtain embeddings that are effective for recommendations if there is no real interaction data or there is insufficient data.

The neural network architecture is shown below:

To vectorize text features, we use a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model. For image embeddings, we use several attribute classifiers based on ResNeXt-50 CNN and then concatenate the resulting embedding values. We also include other features such as price, brand, and category in the final vector.

To minimize the distance between predicted embeddings based on product and user descriptive data and ALS embeddings, we use the MSE (Mean Squared Error) loss function.

After several rounds of training, we render the ALS latent space again.

The new product space is divided entirely into separate clusters. Now products with low customer interaction are mixed with products with high number of interactions.

Let’s try again to find the nearest neighbors for the previous low popularity product ID 8757339 (top left). After users clicked on the target dress just three times, the results became more relevant:

Similar products for recommended products now look much better.

Formal model evaluation metrics calculated on low-popular products also show significant improvement: the total discount coefficient reached (NDCG) @10 from 0.021 to 0.127 and the average accuracy from 0.014 to 0.092.


By fixing the problem of the classic approach to collaborative filtering for low-popular products with Deep Learning, we have a powerful tool for improving the quality of recommender systems. To improve recommendations for incomplete data, experts trained a neural network to approximate ALS embeddings for popular products. Using product characteristics such as image and text, we can predict ALS embeddings for long tail products.

We can now show high-quality recommendations for the entire product catalog, both popular and less popular products.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *