new benchmark for Russian-language embedders

SberDevices Our teams are engaged in creating benchmarks, as well as training models for vector representations of texts or embedders. In this article, we will tell you about our new Russian-language benchmark for text embedders — ruMTEB.

Today, there is a lot of news about large language models (Large Language Models). Every day we hear that generative models show some or other outstanding results in the field of working with text and are gradually moving into a multimodal format. We also take an active part in the development of this area, for example, in one of the previous posts we wrote about the benchmark MERA to evaluate fundamental models, which allows their capabilities to be explored across a wide range of domains and tasks.

However, in this post we would like to write about another, no less important direction in NLP, which, remaining a little aside from the main news stream, was and continues to be extremely relevant. We are talking about vector representations of text, or, as they are also called, text embeddings and embedders, that is, models that allow constructing such vector representations. With the help of embeddings, many different text and multimodal tasks are solved today. For example, tasks related to information retrieval, ranking, semantic search, RAG'omassessment of the quality of text paraphrasing, and many others.

Today, various types of text embeddings are most often used transformer architectures. Embedders we train in SberDevicesyou can find in our repositories on huggingface.

As soon as we have a new class of models and a new class of objects (and now we are talking about text embeddings and embedders), the question arises: how good is the quality of the embeddings and how well does a certain embedder model cope with a particular task? Here we cannot do without an assessment and a set of tasks on which we can assess the quality and run the models. And if there are already a number of well-known benchmarks for the English language with MTEB at the head, then for Russian until recently the only set of tests was the Encodechka benchmark, which appeared several years ago and is still actively used. However, it has significantly fewer tasks (10 versus 56 in MTEB) and does not have tasks for assessing the retrieval capabilities of the model.

ruMTEB tasks

My colleagues and I decided to correct this omission in the field of benchmarking text embeddings and compiled a Russian-language set of 17 tasks, which formed the basis of ruMTEB and which we will discuss below.

In total, ruMTEB includes 23 tasks: 17 datasets mentioned above and 6 multilingual sets from the original MTEB (MassiveIntentClassification, MassiveScenarioClassification, MIRACLReranking, MIRACLRetrieval, STS22, RUParaphraserSTS), from which we took the Russian part. In this article, we will talk specifically about the new 17 tasks for evaluating embedders and show how you can evaluate your models on them.

All tasks were based on existing datasets that have been tested by the scientific community and have proven themselves well.

We filtered the datasets, removed duplicates, converted them to the format of the original MTEB benchmark, and posted them under an open license in our repository. ai-forever. The code for running the models has been integrated into the original repository. mteb (an example of how to evaluate your models on ruMTEB can be found in the section “Evaluating your model on ruMTEB”»).

Below you can find a list of tasks along with their titles on HuggingFace and the source of the data that formed the basis of the datasets.

All tasks can be divided into 7 typical categories: Classification (7 tasks), Pair Classification (1 task), Multi-Label Classification (2 tasks), Clustering (3 tasks), Semantic Similarity Search (STS) (1 task), Retrieval (2 tasks) and Reranking (1 task). Below we will briefly describe what the tasks of each category are:

  • Classification – datasets for the task of classifying sentences/short texts. Each example contains a text fragment for which it is necessary to predict a label. This group of tasks includes both binary classification tasks and multi-class classification tasks, where the number of classes is 3 or more. At the same time, for all datasets, we performed class balancing, equalizing the share of each label in both the train and the test.

  • Classification of pairs – datasets for the task of classifying pairs of sentences/short texts. Each example contains a pair of text fragments for which it is necessary to predict the label. Tasks from this group belong to binary classification.

  • Multi-valued Classification – datasets for the task of classifying sentences/short texts with prediction of several classes for each example. Each is a text fragment for which it is necessary to predict a set of labels.

  • Clustering – datasets for the task of text clustering, where it is necessary to divide texts into non-overlapping clusters.

  • Semantic Similarity Search (STS) — datasets for the task of determining the semantic similarity between texts. Each example contains a pair of texts for which it is necessary to predict their semantic similarity, expressed as a numerical coefficient.

  • Reranking datasets for the reranking task. For each query in the dataset, a set of relevant and irrelevant texts is given. The task is to rerank a set of texts for a specific query in order of decreasing relevance.

  • Search – datasets for the task of information retrieval. Two sets are given: a set of queries and a set of texts. It is necessary to find the most relevant documents from the general pool for each query, and give the answer as a list of the most relevant documents, sorted in descending order of relevance.

We evaluated models and text embeddings on ruMTEB using the MTEB benchmark format. We integrated the code base for evaluation on our sets in original frameworkso now for evaluation it is enough to install the mteb code and run it for the required tasks (an example of running the code can be found below in the section “Evaluating your model on ruMTEB”).

Evaluation of your models on ruMTEB

First of all, you need to install the mteb framework using the command:

pip install mteb

Then all you have to do is run your model, feeding it the task list from ruMTEB. Below is an example for the model sbert_large_mt_nlu_ru.

import mteb
# load a model from the hub (or for a custom implementation see https://github.com/embeddings-benchmark/mteb/blob/main/docs/reproducible_workflow.md)
model_name = "sbert_large_mt_nlu_ru"
model = mteb.get_model(model_name)

names = ['GeoreviewClassification', 'GeoreviewClusteringP2P','HeadlineClassification', 'InappropriatenessClassification', 'KinopoiskClassification', 'RiaNewsRetrieval', 'RuBQRetrieval', 'RuReviewsClassification', 'RuSciBenchGRNTIClassification', 'RuSciBenchGRNTIClusteringP2P', 'RuSciBenchOECDClassification', 'RuSciBenchOECDClusteringP2P', 'RuSTSBenchmarkSTS', 'TERRa', 'RuBQReranking', 'CEDRClassification', 'SensitiveTopicsClassification']

tasks = mteb.get_tasks(languages=['rus'], tasks = names)

evaluation = mteb.MTEB(tasks=tasks)

evaluation.run(model, output_folder="results")

As a result of running the model, a results folder will be generated, in which the results of each task will be written to a separate json.

Comment: And if you want to run the model on the entire ruMTEB, then you need to add 6 more tasks from the original MTEB to the list of tasks: MassiveIntentClassification, MassiveScenarioClassification, MIRACLReranking, MIRACLRetrieval, STS22, RUParaphraserSTS.

Assessment Methodology

Let us briefly describe how, when running the code in this way, a standard evaluation of embeddings occurs for each type of task.

Classification

First of all, the train is downsampled. Namely, a subset of examples containing n (8–16) examples for each label, which is then used as a train. Next, using the embedder, we obtain embeddings for a reduced version of the train and test. On the train, we train logistic regression (no more than 100 iterations), with which we then make a prediction for test examples, and evaluate the result using Accuracy.

Classification of pairs

For pair classification problems, we construct text embeddings for each of the two sentences in a pair. The cosine similarity (Cosine Similarity), for which the best binary cutoff threshold is then selected (it is not for nothing that our tasks in this category are related to binary classification). And Average Precision is used as a metric.

Multi-valued Classification

In multi-label classification, the embedding is used to obtain vectors for the train and test. To evaluate the resulting embeddings, the bootstrap technique is used, where the evaluation is repeated 10 times on different bootstrap subsamples. In each of the 10 experiments, a new training subsample is sampled, containing 8 examples for each label, which is then used to train kNN with 5 neighbors. The result is evaluated using Accuracy, averaged over all experiments.

Clustering

In clustering tasks, for the speed of evaluation on large datasets, a subset of no more than 2048 examples is taken. Then, using the embedding model, text embeddings are built for them. The evaluation itself, like the multi-valued classification, uses the bootstrap technique, when the evaluation procedure is repeated 10 times, and each time a training subsample of size N is sampled. The resulting sample is then used to train the k-means algorithm, where the hyperparameter k is equal to the number of classes. The final metric is V-measureaveraged over all runs.

Semantic Similarity Search (STS)

For problems in this category, we construct embeddings for each text in a pair, and then evaluate their similarity using cosine similarity. The final metric is the Pearson correlation between cosine similarity values ​​and true estimates.

Reranking

For the reranking task, all queries and texts from the dataset are matched with text embeddings. Then, the set of texts for each query is ordered in descending order of the cosine similarity between the query embedding and the text embedding. The result is evaluated using MAP@10averaged across all requests.

Search

For search tasks, all queries and documents in the dataset are matched with embeddings. After that, the documents for each query are ranked in descending order of cosine proximity between the query vector and the document. The result is evaluated using NDCG@10averaged across all queries.

A comment: Note that above we described the standard evaluation pipeline and the main metrics for each task. In addition, a number of additional metrics are calculated during the evaluation process for each type of task, and the user has the ability to control individual evaluation steps (for example, kNN can be used for classification tasks, rather than logistic regression). You can read more about this in framework documentation.

Experiments

Armed with a set of tools we created for evaluating text embeddings in the form of a set of datasets and a convenient code base, we evaluated six popular text embeddings that support the Russian language on the benchmark:

Metric

sbert_large_mt_nlu_ru

sbert_large_nlu_ru

rubert-tiny2

multilingual-e5-small

multilingual-e5-base

multilingual-e5-large

CEDR Classification

Accuracy

0.368

0.358

0.369

0.401

0.4234

0.448

GeoreviewClassification

Accuracy

0.397

0.40

0.396

0.447

0.461

0.497

GeoreviewClusteringP2P

V-measure

0.584

0,590

0.442

0.586

0.545

0.605

HeadlineClassification

Accuracy

0.772

0.793

0.742

0.732

0.757

0.758

InappropriatenessClassification

Accuracy

0.646

0.625

0.586

0.592

0.588

0.616

KinopoiskClassification

Accuracy

0.503

0.495

0.491

0.50

0.509

0.566

RiaNewsRetrieval

NDCG@10

0.214

0,111

0,140

0.70

0.702

0.807

RuBQReranking

MAP@10

0.561

0.468

0.461

0.715

0,720

0.756

RuBQRetrieval

NDCG@10

0.298

0.124

0.109

0.685

0.696

0.741

RuReviewsClassification

Accuracy

0.589

0.583

0,570

0.612

0,630

0.653

RuSTSBenchmarkSTS

Pearson correlation

0.712

0.588

0.694

0.781

0.796

0.831

RuSciBenchGRNTIClassification

Accuracy

0.542

0.539

0.456

0,550

0.563

0.582

RuSciBenchGRNTIClusteringP2P

V-measure

0.522

0.504

0.414

0.511

0.516

0,520

RuSciBenchOECDClassification

Accuracy

0.438

0,430

0.355

0.427

0.423

0.445

RuSciBenchOECDClusteringP2P

V-measure

0.473

0.464

0.381

0.443

0.448

0,450

Sensitive Topics Classification

Accuracy

0.285

0.280

0,220

0.228

0.234

0.257

TERRaClassification

Average Precision

0,520

0.502

0.519

0.551

0,550

0.584

The results of the models are still far from 100%, which indicates that the datasets are complex enough to adequately evaluate the current generation of models and gives hope that ruMTEB will not be “completely solved” for several more years, despite the rapid progress in the field of NLP.

The strongest models are those of type E5 with multilingual-e5-large at the head and the sbert_large_mt_nlu_ru model, which outperforms other models on a number of classification and clustering problems.

For a more detailed analysis, let's also look at the average results by category.

Model Name

Metric

sbert_large_mt_nlu_ru

sbert_large_nlu_ru

rubert-tiny2

multilingual-e5-small

multilingual-e5-base

multilingual-e5-large

Classification

Accuracy

0.554

0.552

0.514

0.551

0.561

0.588

Clustering

V-measure

0.526

0.519

0.412

0.513

0.503

0.525

MultiLabelClassification

Accuracy

0.326

0.3190

0.294

0.314

0.329

0.353

Pair Classification

Average Precision

0.51973

0.502

0.519

0.551

0.550

0.584

Reranking

MAP@10

0.561

0.468

0.461

0.715

0.720

0.756

Retrieval

NDCG@10

0.256

0.118

0.124

0.697

0.699

0.774

STS

Pearson correlation

0.712

0.588

0.694

0.781

0.796

0.831

Average

Average

0.494

0.438

0.431

0.588

0.594

0.630

After aggregation, the advantage of multilingual E5 remains, and sbert_large_mt_nlu_ru is also among the strong models, while sbert_large_nlu_ru lags behind a bit. rubert-tiny2, being a small and less strong model, is at the bottom of the ranking. For all types of classification tasks, the results of multilingual E5 and Sbert models are close, and for clustering tasks, sbert_large_mt_nlu_ru is the best. At the same time, we see a huge gap between the E5 models and both Sbert models for Reranking and Retrieval tasks, which is explained by the fact that Sbert models are not tailored for these types of tasks, they were trained to paraphrase. If we talk about the STS category, the aggregated results here are not very indicative, since this group of tasks includes only one RuSTSBenchmarkSTS dataset, and actually reflect the quality on it.

If we evaluate the complexity of the tasks themselves, the aggregated results confirm what we saw for individual tasks: the results of the models are far from 100%, which indicates that the tasks presented in the benchmark are quite complex for current-generation models and are not “on the verge of being solved.”

Instead of an afterword

In this article, we talked about a set of 17 datasets for evaluating text embeddings, which formed the basis of ruMTEB — a Russian-language benchmark for evaluating Russian-language embedders.

We are open for cooperation and will be glad to see your submissions on ruMTEB!

A large project like the new ruMTEB benchmark is always the result of the joint work of a large number of people and teams. I would like to thank my Team Lead of the AGI NLP SberDevices team @alenusch and colleagues from the SberDevices Experimental Machine Learning Systems Management team: @Andriljo, @anpalmak, @artemsnegirev for the idea and invaluable contribution to the project, which was born only thanks to this joint work!

I was with you – Maria Tikhonova. Until next time on the expanses of ruMTEB!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *