How we implemented 1-to-1 personalization in the catalog and search

There are different ways to organize such ranking. We developed in stages: over several years, we moved from heuristics to the implementation of ML, improving the ranking pipeline.

In this article, we will expand on our approach. So, we will discuss:

  1. What stages does personal catalog ranking consist of?

  2. How the first stage of personalization works – candidate selection.

  3. What is the second stage – online re-ranking of the top search results.

Lamoda ranking history

Catalog ranking is one of the largest and most important products for Lamoda.

It has gone through three stages of improvement:

  1. At first, all ranking was based on statistics and heuristic rules. We divided users into segments, in each of which we calculated in advance our optimal ranking of products in the catalog based on the popularity of the products.

  2. In 2020, ML was introduced into ranking for user segments. Now the probability of adding to cart is predicted by an ML model for each product and segment.

  3. In 2023, ML personalization was introduced for each user, which re-ranks the top products in the search results at the time of the request.

The historical path of ranking development in Lamoda

The historical path of ranking development in Lamoda

We talked about the first two stages in previous articleNow we want to share the path we have taken in personalizing the catalog for everyone, the difficulties we have encountered, and how we plan to develop further.

Two-stage personal ranking

So, we have a catalog-wide ranking for large user segments and it works like this:

How are segments defined in segment ranking?

How are segments defined in segment ranking?

But how do we move towards personalization?

The main difficulty is that personalizing an entire catalog for each user is a resource-intensive task: Lamoda has more than 600,000 products in stock and their number is growing along with the growth of the business. Therefore, we began to think about how to limit the list of products that will be ranked individually.

Of course, we are not the first to face such a task. The proven approach in the industry, which has already become classic for recommender systems, consists of two stages:

  1. Selection of candidates for ranking from different sources. At this stage, we form a list of several hundred products that are the most relevant to the user.

  2. Re-ranking. To rank these products in the best possible way, we use heavy models with a full set of features for the product and user — and give the final sorting.

Taking this scheme as a basis, we had to implement it using the tools we had.

The first stage is boosting candidate products using ElasticSearch

ElasticSearch is an efficient solution for queries in large search indexes. Our catalog service actively uses it as the main search and ranking engine, it also stores information on products and their attributes. It also stores product relevance scores calculated for user segments after the first ranking stage.

So we decided to try using this tool to immediately take into account users' personal preferences. This way, we can combine segment ranking and candidate selection into a single Elastic query.

To implement this idea, we used the functionality function score in ElasticSearch, which allows you to add a score/relevance function to a product based on its attributes and the user's query. This function can be complex, taking into account different product attributes with different weights. Here's how you can embed this function in a query, specifying the relevance of a specific user attribute to the product as weights:

{
  "query": {
    "bool": {
      "filter": [
        ... // Простые фильтры в каталоге
      ],
      "must": [
        {
          "function_score": {
            "functions": [
              // Персональные предпочтения
              {
                "filter": { "term": { "brand": "Zarina" } },
                "weight": user_weight_brand_Zarina
              },
              {
                "filter": { "term": { "color": "black" } },
                "weight": user_weight_color_black
              }
            ],
            "score_mode": "sum",
            "boost_mode": "sum",
            "query": {
              "function_score": {
                "boost_mode": "replace",
                "functions": [
                  {
                    "script_score": { ... } // Популярность + релевантность из сегментного ранжирования
                  }
                ],
                "query": { ... } // Фильтрующий поисковый запрос
              }
            }
          }
        }
      ]
    }
  }
}

The idea is to add a score reflecting the user's preferences for product attributes to the relevance score from the segment ranking. To do this, you can use the main characteristics of the products as score function values, and the user's attitude to these attributes, taking into account their importance, as weights. In this way, we will add weight (raise higher in the search results) to products whose attributes the user “likes”.

Calculating user preferences

For products in different categories, we know more than 30 attributes: what brand the product belongs to, its color, size, whether it has a pattern, and so on. We also know what products each user interacted with: clicked on the product card, added to the cart or favorites, bought. Each such interaction will have its own weight. A slightly higher coefficient is for orders, a smaller one is for adding to the cart and favorites.

Using this data, we evaluate the user's preferences by attributes. For example, how often did they add the Lime brand to the cart, buy things of size XS, or how many times did they add pink things to their favorites. At the output, we get vectors of user preferences for attribute values ​​(for example, for different brands, colors, etc.), calculated for different categories.

Influence of facets

Influence of facets

We call these user attributes facet scales or simply facets.

We calculate the weights of attributes and form the final score

Okay, we've calculated preferences for users. How do we now modify the ElasticSearch ranking score to take preferences into account organically along with the relevance from the first ranking stage?

To do this, it is necessary to understand what weights to use to consider the user's inclination towards each attribute, because they are not equivalent. After all, the user's love for the brand may be more important than their preferences for sleeve length.

To understand with what weight to take into account different attributes, we will train logistic regression to predict the probability of a user adding a product to the cart based on the features of his/her inclination towards the product attributes. And to take into account the correlation from the first ranking stage, we will also add it as a feature. We will train models separately for clothing, footwear, accessories, home goods and cosmetics, so that, for example, the brand weight is different for abibas boots and abibas T-shirt.

Having trained the model in this way, we will obtain coefficients for attributes that will reflect their importance. Next, we will divide each coefficient by the module of the coefficient for the “relevance rate” feature from the first ranking stage (sfPos in the picture below) and obtain the final coefficients. This normalization is necessary to organically take into account the coefficients for attributes, together with the relevance rate from the first ranking stage at the time of query processing. At the same time, we do not take into account negative coefficients, simply zeroing them out to give products only a positive additional boost.

Normalization of logistic regression coefficients

Normalization of logistic regression coefficients

In this way, we obtain coefficients by which we will multiply personal facet weights in the function score in order to give an additional boost to products that are relevant to the user.

Let's see what function_score would now look like in an ElasticSearch query with the weights obtained in the example above. For the user who submitted the query, we multiply their brand, size, and color preferences with the normalized coefficients from the logistic regression model (w_brand = 0.313, w_size = 0.137, w_color = 0.175).

{
  "function_score": {
    "boost_mode": "sum",
    "score_mode": "sum",
    "functions": [
      { 
        "filter": { "term": { "brand": "Zarina" } }, 
        "weight": 0.313 * 3 
      },
      { 
        "filter": { "term": { "brand": "Mango" } }, 
        "weight": 0.313 * 2.5 
      },
      { 
        "filter": { "term": { "brand": "Abibas" } }, 
        "weight": 0.313 * 0.1 
      },
      { 
        "filter": { "term": { "size": "L" } }, 
        "weight": 0.137 * 1.5 
      },
      { 
        "filter": { "term": { "size": "M" } }, 
        "weight": 0.137 * 3 
      },
      { 
        "filter": { "term": { "color": "black" } }, 
        "weight": 0.175 * 5 
      },
      { 
        "filter": { "term": { "color": "pink" } }, 
        "weight": 0.175 * 0.1 
      }
    ]
  }
}

User preferences for product attributes are calculated daily. Only segment ranking will be used during a cold start.

Under the hood, elastic will sum up the received values ​​for all attributes with the speed of segment ranking and thus we will get a personal boost of goods.

We test the result

After adding the personal candidates boost, the results for users changed dramatically. Let's look at the example of a girl who prefers things of certain brands and colors.

The first picture shows the segment ranking output, the second picture shows the segment ranking with facet weights. The facet weights and output show that the girl prefers black clothes, long skirts and dresses, and the Zarina brand. Without facets, most of these items would not even make it into the top 500.

We really liked the result! We immediately rolled it into A/B testing.

However, we did not achieve the desired results: behavioral click metrics increased, but we were unable to paint the target metric – there was no increase in purchases and orders. We were a little upset, but our main expectations were from the second stage, the test of which was still ahead.

The second stage is re-ranking

The first idea for implementing a second-level model for re-ranking was to connect the LTR plugin to ElasticSearch: it can use gradient boosting, among other things. So all our ranking would live inside Elastic: first, segment ranking and boosting of personal candidates from the first stage, and then re-ranking of a small top. However, this solution has two disadvantages. First, to use it correctly, you will need to work with a Java stack that is not typical for us. And second, we will always be limited by the implemented functionality of the plugin.

That's why we decided to go the route of creating a separate service in Golang – a fast, compiled language in which we have experience and development resources.

Its task is to launch a regularly retrained model on features collected from specified databases at the moment when the user requests a catalog page. And since the service is independent, we are not limited in the ways of forming features and launching the model.

Data, features, model training

To train the model, we use a dataset of user sessions corresponding to product views in a specific catalog category or search. For each view, we attribute the event of the user adding a product to the cart, so in the resulting group of interactions, negative examples are product views without the target event, and positive ones are product views with it.

To train the model efficiently for all types of devices, taking into account uneven traffic, we sample a fixed number of groups from one day of logs for each platform (mobile site, desktop, Android, iOS).

We use product attributes from the model as features. segment rankingand also add faceted user attributes, counters for adding to cart, views and orders by time windows.

On our data, the best quality was shown by the CatBoost model with YetiRank loss. Interestingly, according to experiments, negative sampling of only 20% of impressions without a target action does not worsen the quality of the model, but allows using more days of logs, which gives better quality. Thanks to this, we are now learning daily on the last 17 days.

Our main metric for offline evaluation is NDCG@60, but we also monitor a number of qualitative metrics – search results diversity, top price, etc.

The infrastructure of the resulting solution looks like this. Regular offline parts of the pipeline – feature collection, train dataset collection and model training occur on our hadoop cluster daily using Spark. After that, the model is uploaded to S3, meta information on it to etcd, and updated features on the user and products to Aerospike – a key-value database with quick online access. From these sources, the go service loads the model itself and collects features for a specific request.

Model work pipeline

Let's now summarize and run through the entire pipeline of personal ranking on production. A user comes, selects something in the catalog or searches in the search, at this time a request is generated in the Service Catalog on the backend. This service enriches the request with the user's context and sends it to Elastic, whose task is to filter the products that match the request and make a faceted ranking.

After that, the top 300 candidates by the final Elastic score are sent to the Service Ranking re-ranking service, where all the necessary features for applying the model are pulled up for them and the user. And then, finally, the CatBoost model is inferred, which returns the final ranking to the Service Catalog, which in turn gives it to the user.

And some technical details: for Service Ranking we try to keep SLA 250ms, 50th percentile of response time is 50ms. Response time is critical for the quality of your pipeline online, as it directly affects business metrics.

Importance of Features

SHAP importance

SHAP importance

Let's now dig into the trained model and study the obtained importance of features. The top is made up of user preferences for brands, sizes and colors, then the features of product conversion, their popularity, price, rating, etc. are distributed relatively evenly across the top.

Launch history

The first experiment was an A/B test of adding a re-ranking service for 150 candidates. The result of this test was very successful: the entire funnel was colored. That is, users began to click more often, add to cart more often, make purchases more often. We did it so that both users and businesses liked the result. According to the metrics, we got +2% NMV and +1.7% to the conversion to purchase.

The next successful launch was an A/B test, in which we expanded the re-ranking service to search queries, and also increased the number of candidates to 300. We also began retraining the model daily. In total, these improvements gave us an additional +1.3% NMV.

Results and Conclusions

Let's now return to that girl, the lover of black, long skirts and Zarina. Let's see how her results changed when we added a second-level model:

The style of the issue has been preserved and even more black. The Zarina brand has become slightly smaller, there are slightly more loose pants. Visually, this is not as noticeable as when using only facets at the first stage. But the business metrics have increased significantly!

Let's sum it up:

  • Ranking Personalization – Works and Increases
    both behavioral and business metrics.

  • Selecting candidates through Elasticsearch gives you enough flexibility and filtering out of the box.

  • A separate service on Golang with a second-level CatBoost model — we recommend it! This allows you to flexibly manage candidates that go to the model input, easily improve and test the model offline, and do fast inference online.

What's next?

Of course, we have many plans for further development, here is a small shortlist:

  1. Learn how to add candidates to the top from other sources/models

Currently, the only candidate source for the reranking model is segment ranking with faceted personalization. We want to expand the number of candidate sources for the second-level model to ensure greater diversity and relevance of the products at the top.

  1. Increase the number of candidates

At the second stage, we can now re-rank only 300 candidates. Of course, we want more, so we need to scale up the current service.

  1. Give up the time lag

In the current implementation, personalization occurs with a lag of one day. We are not yet able to pick up user actions online, but we are moving in this direction, adding hot features for the user to the model.

  1. Eliminate data drift between training and model application

Currently, features for training the model are collected on historical clickstream data, and the model is applied on features that are daily uploaded to the DB or taken from production systems (for example, stock availability). Therefore, we are working on getting rid of the shift between features on inference and during training, which happens one way or another. We want to be sure that the features we use to build the train dataset are exactly the same as at the time of applying the model. To do this, we plan to log features at the time of model inference and retrain the model on them.

As you can see, there are still many interesting tasks ahead, we will definitely share the results here and in the Telegram channel Lamoda Tech. If you like what we do and want to be part of new solutions, now is the time join to our ranking and search team. Come to PM: Sergey, Dana.

Thanks for reading, stay tuned!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *