The first free translation model from Russian to Chinese and vice versa


I present to your attention, the first free offline model for translating from Russian into Chinese and vice versa.

Earlier, I wrote how you can easily train your model for machine translation using the example of translating from English into Russian.

This time I decided to implement the translation model from Chinese, as I have long wanted to and what I stated in the comments to my previous article.

You can get acquainted with the last article here: https://habr.com/ru/post/689580/

Introduction

The topic of machine translation has always been quite relevant, and remains so even now. To train a machine translation model, a large number of parallel texts are required. If there are a lot of parallel texts with English, then with other languages ​​like Chinese, everything is much more complicated, and finding a large number of texts is quite difficult.

A significant breakthrough in this issue was made after the release of the CCMatrix corpus. https://opus.nlpl.eu/CCMatrix.php

In which 1197 bitexts were scanned, in 90 different languages ​​of the world.

At the moment, models like facebook/nllb-200-distilled-600M have appeared that can translate from 200 different languages ​​​​at once, but after my analysis, they showed that such models translate quite poorly, and in real work it is almost impossible to use them for translating texts . In addition, the cc-by-nc-4.0 model license does not allow commercial use.

Data for training

To train my machine translation model, I use a large number of parallel text corpora, from the site https://opus.nlpl.eu/

As training data, I used a list of text corpora: UNPC_v1_0_ru-zh, CCMatrix_v1_ru-zh, MultiUN_v1_ru-zh, LinguaTools-WikiTitles_v2014_ru-zh, News-Commentary_v16_ru-zh, WikiMatrix_v1_ru-zh, Tanzil_v1_ru-zh, MultiParaCrawl_v9b_ru-zh, bible-uedin_v1_ru -zh, TED2020_v1_ru-zh, infopankki_v1_ru-zh, tico-19_v2020-10-28_ru-zh,
QED_v2.0a_ru-zh, NeuLab-TedTalks_v1_ru-zh,
PHP_v1_ru-zh, wikimedia_v20210402_ru-zh, ELRC-wikipedia_health_v1_ru-zh, Ubuntu_v14.10_ru-zh, EUbookshop_v2_ru-zh

In total, the corpus turned out to be 35 million paired sentences and translations.

Education

The training was carried out entirely on the basis of the instructions from the last article https://habr.com/ru/post/689580/

As a result, we got two machine translation models that can be used in the Argos Translate translator.

The first model is translate-ru_zh-1_7.argosmodel – needed to translate from Russian to Chinese

The second model is translate-zh_ru-1_7.argosmodel – needed to translate from Chinese to Russian

Code for using the model in Python:

import pathlib
import argostranslate.package
import argostranslate.translate

package_path = pathlib.Path("translate-zh_ru-1_7.argosmodel")
argostranslate.package.install_from_path(package_path)

from_code="zh"
to_code="ru"

translatedText = argostranslate.translate.translate("吃一些软的法国面包。", from_code, to_code)

print(translatedText)
#Съесть мягкий французский хлеб.

This time, I decided to train the model in the same way, with a more popular architecture that developers could use in their projects. I took the mbart50 architecture as a basis, with pre-trained weights.

This model was specially created for solving machine translation problems, the pre-trained weights of the model have already been trained on the task and using texts in 50 different languages ​​of the world.

For machine translation, I did not train the mbart50 model from scratch, but took the already pre-trained weights and used the additional training method (fine-tune)

Initially, the model contained tokens from 50 different languages, but I only needed a pair of Russian and Chinese, as a result, before training, I compressed the model by removing extra words on the embedding layers and on the last layer of the neural network, as a result, the number of tokens in the dictionary decreased from 250 thousand to 45 thousand. At the same time, the model lost weight from 2.3 gigabytes to 1.6 gigabytes, a little more than 30%, without changing the quality and speed of its work. Training such a model has become a little easier.

I posted the code for such compression Here.

To further train the model, I used the instructions from the site from huggingface

https://huggingface.co/docs/transformers/training

I trained the mbart model to translate immediately from Chinese into Russian and vice versa. As a result, one model can be used for translation from Russian into Chinese, and from Chinese into Russian. Which in turn allows you to use one model for the task of paraphrasing. By translating the text first from Russian into Chinese, and then back. As a result, we can get a paraphrased version of the same text.

Paraphrase function code example:

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

model = MBartForConditionalGeneration.from_pretrained("joefox/mbart-large-ru-zh-ru-many-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("joefox/mbart-large-ru-zh-ru-many-to-many-mmt")

def text_paraphrase(src_text):
    
    # translate Russian to Chinese
    tokenizer.src_lang = "ru_RU"
    encoded_ru = tokenizer(src_text, return_tensors="pt")
    generated_tokens = model.generate(
        **encoded_ru,
        forced_bos_token_id=tokenizer.lang_code_to_id["zh_CN"]
    )
    result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

    # translate Chinese to Russian
    tokenizer.src_lang = "zh_CN"
    encoded_zh = tokenizer(result, return_tensors="pt")
    generated_tokens = model.generate(
        **encoded_zh,
        forced_bos_token_id=tokenizer.lang_code_to_id["ru_RU"]
    )
    tgt_text = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

    return tgt_text

result = text_paraprase("Съешь ещё этих мягких французских булок.")

print(result)
#Ешьте французский хлеб.

Evaluation of results

To evaluate the results, I had to independently prepare a new corpus of Chinese-Russian translations. As a basis, I took the original corpus, which is used to evaluate opus models, this is the corpus of texts news test. But since this corpus lacks a couple of translations of zh-ru. I assembled my case, based on the available ones. To do this, I took all the news translated from Chinese into English, and from English into Russian, and found identical news intersecting among them. As a result, I got a very good Russian-Chinese corpus, on which you can evaluate the quality of translation from Russian into Chinese and vice versa. I decided to share this case with everyone, and now you can download it here newstest-2017-2019-ru_zh

I evaluated the quality of translation of various models on the published dataset newstest-2017-2019-ru_zh, and compared the performance.

The CPU and GPU column shows the translation time of one sentence by this model without the use of batches, in milliseconds.

For complete clarity, I took two more well-known multilingual machine translation models m2m100_418M and nllb-200-ditilled-600M for comparison, and also ran the dataset translation through the Google Translate translator using the Python library – translators. However, I noticed that in the browser Google Translate translates better than through the translators library.

According to the obtained metrics, the fastest model on the CPU turned out to be the model trained for argos_translate, and in terms of quality it significantly outperforms both the Google Translate translator and the m2m100_418M and nllb-200-ditilled-600M models.

My model joefox/mbart-large-ru-zh-ru-many-to-many-mmt turned out to be the highest quality model, showing the sacreBLEU metric of 12.23 in the Chinese to Russian machine translation task.

conclusions

Until the models are included in the official Argos Translate repository, you can download them from Yandex disk and use either in the Argos Translate application or in python.

translate-en_zh-1_7.argosmodel

translate-zh_ru-1_7.argosmodel

Model mbart posted on huggingface joefox/mbart-large-ru-zh-ru-many-to-many-mmt with examples of its use.

I post models in the public domain, I hope that they will be useful to you.

I would also like to continue working on improving translation models from Russian to Chinese and vice versa. Therefore, if you have any ideas how to improve the quality of these models. Maybe you know where you can find a large number of Russian-Chinese parallel corpora of texts on the Internet, send any information in a personal or in the comments. I will be very glad to any ideas.

In addition, you can train your model yourself, and use the dataset newstest-2017-2019-ru_zhto evaluate the quality and compare with my model.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *