Practical aspects of ranking responses from the virtual assistant Salyut

Hi all! My name is Anya Maksimova, I do NLP in the Neural Networks team of the Interlocutor product! A conference was held on April 5 Giga R&D Daywhere my colleague Artem Snegirev talked about practical aspects of ranking responses from the virtual assistant Salyut.

In this article we will tell you in more detail how we rank answers using the example of an interlocutor who is part of the Salyut assistants.

The assistant has three voices – Sber, Athena and Joy. The interlocutor is responsible for communicating on various topics, answering factual questions and providing entertainment content. As a rule, the assistant responds with generative models, but there are scenarios where prepared replicas are used, and there are quite a lot of them, so we use search – this is a classic retrieval-based approach.

Untitled

What is an interlocutor?

How does our search work?

With the help of context embedding, we can search our database, where there are about a million different replicas that have metafiction that characterizes them. Using faiss index, we get 128 closest candidates, and after filtering, 32 remain, feed them into the cross-encoder and get a relevance score. Next, we collect features: SCU model assessment, cross-encoder assessment, dialog features and meta features of each replica and send them to the LightGBM ranker for final ranking.

Untitled

Moreover, we do not use cross-encoder for ranking, but we use its evaluation as a feature for ranking, which has a positive effect on quality.

Using the QR code from the picture you can learn more about our search!

How does a classic cross encoder work?

The classic cross-encoder is implemented in a similar way to the next sentence prediction task in pre-training BERT. Through special tokens, we concatenate replicas in the dialogue and send them to our model, where the tokens interact with each other through the attention mechanism. From the last layer we take the embedding of the first token, pass it through the linear layer and the sigmoid to get a relevance score.

Untitled

Classic cross-encoder

We conducted experiments with a cross-encoder, thanks to which the quality of ranking improved. In this article we will talk about their results.

The more examples for context, the better the quality.

The classic implementation usually uses binary classification, that is, it divides replicas into good/bad, which is not very suitable for the ranking task. In turn, good answers differ in quality, which together requires rating each example in the range from 0 to 1. For the final ranking quality, it is usually better to use a large number of negatives, and this leads to an imbalance of classes. We have reduced the problem to choosing the optimal answer from N proposed ones, one of which is good. The model learns a relative relevance score instead of an absolute one, and the value of the loss function does not depend on the class balance. Now it is not necessary to divide the answers into 2 different groups; instead, we get groups where 1 is better than the others.

So we took an idea from Contrastive Learning and replaced binary classification with InfoNCE, also known as MultipleNegativesRankingLoss, SimCSE, and in-batch negatives loss

InfoNCE-loss, s – model, c – context, r – answer, t – temperature

-\frac{1}{n} \sum_i \log \frac{e^{\frac{s_\theta ( c_i , r_i) }{t}}}{e^{\frac{s_\theta ( c_i , r_i ) }{t}} + \sum_j e^{\frac{s_\theta ( c_i , r_{\overline{ij}}) }{t}}}

We have a positive example and negative ones, and now we do not require some absolute speed value for the positive example, but want it to be at least a little better than the speed of the negative ones. How much better is corrected using temperature t, which can be either a constant or a parameter whose value is selected during the learning process.

As negative examples we use hard, random and semi-hard negatives.

Random negatives help models find the most understandable patterns, and from semi hard we found a good increase in metrics.

To obtain semi-hard negatives, we use NLU and SCU models, for each dialogue set we build a database of replicas, pass unique responses through a search, where we get the top candidates and sample them by rank or proximity. This ultimately increases the variety of examples for each context.

Untitled

Mining negative examples. Sample k elements by rank [r1,r2] or nearby [s1,s2].

Where to get the data?

Previously, we translated datasets from English into Russian, because our model only knows Russian. But there are very large datasets, so we decided to train the model in English right away: we supplemented the RuRobrerta-Large tokenizer with English language tokens, received additional data on a small number of Russian and English texts, and aligned the language pair. The alignment was done as follows: we submitted translations, received embeddings and brought them together using MSE loss, that is, we taught the model to produce the same embedding for translations. You can see the same procedure in the LaBSE model training pipeline.

We teach the model to make embeddings in one space

We teach the model to make embeddings in one space

To test our extended bilingual model, we took a blended still talk (BST) dataset containing 100 thousand examples from the English synthetic dialogue set SODA, as well as its Russian translation. We can notice that when training on BST, the values ​​of the recall@1/8 metric for the English and translated versions of BST are almost the same. The small difference in the metric is the combined error of the translation and how well the pair was aligned in the model.

train

test

recall@1/8

bst_ru

bst_ru

0.414

bst_en

bst_en

0.422

We wanted to simulate the following scenario: take the main data in Russian and add a small part in English data, as a result, on the target task (bst_ru) we received an increase of 6 points.

train

test

recall@1/8

bst_ru

bst_ru

0.414

soda+bst_ru

bst_ru

0.478

soda+bst_en

bst_en

0.486

soda+bst_en

bst_ru

0.451

We can also pay attention to the 2nd and 3rd lines of the table, where you can see that the difference between training on English data and on mixed data is only 1 point, this shows that the English set does not “pull” the model over itself when it is added to Russian data.

In our case, knowledge transfer between languages ​​works great!

How do we mark up data in production?

To mark data from production to obtain hard-negatives, we look at the following criteria:

  1. Safety – we do not communicate on certain topics and follow the principle of “do no harm” in relation to ourselves, the user and others.

  2. Reliability – we check constant facts for which it is possible to find a reliable source.

  3. Relevance – the main thing is to understand the user’s intention, the meaning of the dialogue, to answer “on topic”, without taking into account other criteria.

  4. Logic – we look to see if there are any contradictions with the entire context of the dialogue and general knowledge about the world.

Try to guess which criteria the examples in the picture do not fulfill!

Try to guess which criteria the examples in the picture do not fulfill!

If all criteria are met, then we consider the answer to be good, otherwise – bad.

Calculation Caching

The idea of ​​caching key and value during attention calculations is already familiar.

KV caching is to get rid of the re-computation of key-value matrices of past tokens at each generation stage by storing (“caching”) these tensors in GPU memory as they are calculated during the generation process.

KV Caching in encoder-decoder models.

These models generate tokens one by one and can be computationally expensive. To solve this problem, caching of previous key and value is used so as not to recalculate them for each new token, which reduces the size of matrices, speeding up their multiplication. True, storing key and value requires more memory. The full key and value must be combined before attention can be calculated, so these two values ​​must be cached and reused during each output.

Untitled

Untitled

There are other similar approaches: Cross-attention in transformer decoder, where keys and values ​​are taken from the encoder, and queries from the decoder, Chunked Cross-Attention in RETRO modelswhich counts query for the input tokens, and key and value for the resulting neighbors.

Why not reuse this approach for our task?

Our approach:

In a typical cross-encoder implementation, we feed the model a sequence of n + m tokens, where n is the length of the context and m is the response. But with the increase in answer candidates, this can take a long time, for k candidates we need to feed the model a sequence of n + m tokens k times, given that the context does not change, why not feed it separately and 1 time?

And we decided to use the following approach: first, only the context sequence is fed into the network, key/value matrices from each layer are saved, then k sequences of candidates and pre-calculated key/value matrices are fed, which are added during self-attention to the key/value matrices of candidates for each layer. Thus, we count the representations for the context tokens once and reuse them to enrich the representations of the candidate tokens with respect to the context. Key/value matrix concatenation is similar to KV-caching in LLaMa and GPT.

Untitled

This idea was supposed to speed up the speed of the model and reduce memory, and that’s what happened, and the quality metrics did not drop.

The measurements used bert-large in fp32, the response length is always 32 tokens (included), calculations were made on A100 cards.

Memory reduction (n times) depending on sequence length and number of candidates

Memory reduction (n times) depending on sequence length and number of candidates

Reduced latency (n times) depending on the length of the sequence and the number of candidates.

Reduced latency (n times) depending on the length of the sequence and the number of candidates.

This allowed us to increase the number of candidates from 32 to 64 and the length of the sequence from 128 to 256 tokens, while we did not lose speed and reduced latency by 3 times.

Now we can increase the number of candidates, which will give us an increase in quality, and increase the length of the context to diversify the replicas, or take a larger model.

Note that when we made the first pass of the model through the context, on the last layer we get the embedding of the context tokens, by doing pooling we can get the embedding of the entire context, this is what we did with the scu-model, then we can remove the scu-model from the pipeline and simply use embedding this context for search, as in biencoders. The result is a model that does both database searching and ranking.

We can move on to the following architecture: the left tower is for response, the middle tower is for context, and the right tower is for ranking. The result will be a search consisting of one model, able to quickly search by index and at the same time rank answers.

Untitled

Untitled

Eventually

Contrastive learning increased the metrics by 20%, as we reduced the requirements for the quality of data markup, because we added more variety to the negatives due to semi-hard negatives and a new loss.

We can use sets more effectively without manual marking of negative examples.

We can add knowledge through a bilingual model and data in English.

We found a way to rank ≈7 times faster and ≈1.6 times less memory, which will be useful for increasing tokens in the context.


That's all, the Neural Networks product Interlocutor team was with you. Try our solutions! Our models: sbert_large_nlu_ru, ruElectra-medium (as well as small and large), sbert_large_mt_nlu_ru. Try how it works Companion possible on all our smart devices and in Salute app!

See you in the next posts.

Thank you for your help in preparing the material to Sasha Abramov aka @Andriljo and also to the authors Anya Maksimova @anpalmaki and Artem Snegirev @artemsnegirev.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *