In the first part of this article, you will find a story about the evolution of machine translation quality metrics, as well as about the main traditional metrics. In Part 2, let’s move on to neural network metrics. Let’s start with reference metricsthe calculation of which requires a reference – a reference translation, usually performed by a person.
We will get acquainted with non-reference neural network metrics already in Part 3 of the article, in the same place we will look at the comparative effectiveness of different metrics.
For neural network metrics, the object of comparison is not words, but their embeddings, so first a few words about this concept.
The concept of “embedding”
Embedding is a vector representation of a word, that is, the encoding of a word through a set of numbers, obtained as a result of special models that analyze word usage in large sets of texts.
This approach was introduced by Mikolov et al. in 2013 in work Distributed Representations of Words and Phrases and their Compositionality. The embedding algorithm presented by Mikolov et al. is known as Word2Vec.
Under the cut – a couple of videos to understand the essence of embeddings.
Video about the concept of “embedding” from Andrew Ng:
Video about the properties of embeddings – from him (Andrew Eun is cool!):
Reference neural network metrics
After the advent of embeddings, metrics began to appear that evaluate not the lexical, but the semantic similarity of machine translation with the reference one (roughly speaking, not the coincidence of words, but the coincidence of meanings).
All metrics that compare embeddings can be considered neural network metrics, since the embeddings themselves are obtained as a result of training various models.
The general logic of the development of neural network metrics is outlined in Part 1 of this article. Here we consider the most famous of the reference neural network metrics.
WMD (Word Mover’s Distance) – metric, proposed Kusner et al. in 2015 to assess the semantic similarity of texts. The calculation of this metric is based on an estimate of the proximity of word embeddings obtained using the Word2Vec algorithm.
The proximity of two sentences – generated by the model and the reference one – is estimated using Earth mover’s distance (EMD) for the embeddings of the words that make up these sentences. The calculation of EMD is closely related to the solution of the optimization transport problem.
ReVal (Gupta et al., 2015) is considered the first neural network metric proposed directly for assessing the quality of machine translation.
The calculation of this metric is performed using a recurrent (hence the name of the metric) neural network model LSTM, as well as a vector of words GloVe.
ReVal correlates substantially better with human translation quality scores than traditional metrics, but worse than more recent neural network metrics.
BERTScore – metric, proposed Zhang et al. in 2019 to evaluate the quality of the generated text. Based on proximity assessment of contextual embeddings obtained from a pre-trained BERT neural network model.
To calculate BERTScore, the proximity of two sentences – generated by the model and the reference one – is estimated as the sum of cosine similarities between the embeddings of the words that make up these sentences.
Under the cut – a video about BERTScore.
Learn more about BERTScore
YiSi is a metric for assessing the quality of machine translation with a flexible architecture, proposed Chi-Kiu Lo in 2019. YiSi can evaluate both the basic general proximity of machine and human translations, and take into account the analysis of surface semantic structures (optional).
As with BERTScore, the base proximity calculation in YiSi is based on cosine similarity scores between embeddings derived from the BERT model.
BLEURT (Bilingual Evaluation Understudy with Representations from Transformers) is another metric based on embeddings from BERT, proposed Sellam et al. in 2020 to assess the quality of text generation.
For the purpose of calculating BLEURT, the BERT model was trained on:
a synthetic dataset in the form of pairs of sentences from Wikipedia;
an open set of translations and human-assigned ratings from the WMT Metrics Shared Task.
The calculation of the BLEURT metric is performed by neural network methods.
Prism (Probability is the metric) – machine translation quality metric, proposed Thompson, Post in 2020 based on their own Prism multilingual transformer model.
The authors noted the similarity of the task of assessing the proximity of machine translation to the reference one and the task of assessing the probability of a paraphrase. As an estimate, the probability of such a paraphrase is taken, in which the reference translation (performed by a person) would correspond to the text of machine translation.
Given this approach, it was not necessary to use human evaluations of the translation quality to train the model.
COMET (Crosslingual Optimized Metric for Evaluation of Translation) – metric, proposed Rei et al. in 2020 based on our own model and a whole framework for training other translation quality assessment models.
COMET uses the XLM-RoBERTa multilingual model as an encoder, on top of which additional layers are added, the output of which is an assessment of the quality of the translation. The model accepts as input not only a machine translation (hypothesis) and a standard (reference), but also a translated source text (source).
The model is trained on triples of hypothesis-source-reference data, as well as translation ratings given by a person (from WMT, as for BLEURT). Training is done by minimizing the Mean Squared Loss (MSE) of the scores given by the model from the true translation scores.
UniTE (Unified Translation Evaluation) – metric, proposed Wan et al. in 2022 based on our own model. Like COMET, UniTE uses an XLM-RoBERTa encoder, with additional layers.
The UniTE architecture provides for the possibility to input one of the following combinations: 1) machine translation and reference translation, 2) machine translation and original source, 3) all three types of data.
Unlike COMET, where each of the inputs is encoded separately, in UniTE reference, prediction and source are encoded together, and already in the form of a single structure they enter into the calculation of the translation quality score. The UniTE-MRA version uses a mechanism attention.
This series of metrics is not officially documented, but was submitted for testing as part of the WMT Metrics Shared Task and proved to be quite good. This series was the result of additional training of several versions of the large mT5 language model on various data on the human assessment of the quality of translation (I will talk about data types DA – direct assessment and MQM – Multidimensional Quality Metrics – in Part 3 of this article).
We have reviewed the most famous reference neural networks metrics. They are usually correlate better with human quality scores translation than traditional ones, but they also have their own flaws. First of all, this is the inexplicability of certain estimates, since neural network calculations are performed according to the “black box” principle. We also note the higher, compared to traditional metrics, requirements for computing resources.
There is another class of neural network metrics that deserves separate consideration. This unreferenced metrics, the calculation of which does not require a reference translation performed by a person. Part 3 of our article will be devoted to this class of metrics. Also in Part 3, we finally learn what is comparative efficiency various traditional and neural network metrics.