My little help for small languages

It was a difficult year: taxes, disasters, banditry, elections and the rapid disappearance of small languages. It was impossible to put up with the latter …

A large number of peoples live on the territory of Russia, speaking more than 270 languages. About 150 languages ​​have less than 1,000 native speakers, and over the past 20 years, 7 languages ​​have already disappeared.

This project is my “five kopecks” to support linguistic diversity. Its goal is to help machine translation researchers, linguists, and enthusiasts who care for their native language. We will help by obtaining parallel buildings, – a kind of “fuel” with the help of which modern models are increasingly trying to understand the human language.

Today’s languages ​​are Bashkir and Chuvash, with whose popularizers I have been in close contact lately. First, I’ll show you how, in principle, to extract a corpus from two texts in different languages. Then we will face the fact that the pretrained model has not been trained in the languages ​​under consideration and will try to retrain it.

We will experiment in the Colab environment, so that any researcher, if desired, can repeat this approach for his own language.

I. Removing the parallel body

To align two texts, I wrote a python library lingtrain_aligner… Her code is open. She uses a number of pre-trained models, you can connect your own. One of the most successful multilingual models now is LaBSE. She studied in 109 languages. Since the ratio of texts is biased towards popular languages, for them the quality of embeddings (embedding called a vector of numbers in relation to the data that it describes) will be better.

Colab

You can try to extract the corpus in the required language in this Colab’e… Next, we will go through the steps in more detail.

Installation

Install the library with the command

pip install lingtrain_aligner

After that, we import the necessary modules:

from lingtrain_aligner import splitter, aligner, resolver, metrics

We will break our texts (take the chapter from Harry Potter for example) into sentences using the module splitter… Then we will create a file with data for alignment (sqlite database) and load the received sentences into it. The module is responsible for this. aligner

lang_from = "en"
lang_to = "ru"
db_path = "alignment.db"

splitted_from = splitter.split_by_sentences(text1.split('n'), lang_from)
splitted_to = splitter.split_by_sentences(text2.split('n'), lang_to)

aligner.fill_db(db_path, lang_from, lang_to, splitted_from, splitted_to)

To take into account the peculiarities of the grammar of the language (for example, special types of quotation marks, the absence of spaces and other linguistic exotics) must be transferred to splitter the corresponding parameters. Let’s align the texts using the following command:

aligner.align_db(db_path,
                model_name="sentence_transformer_multilingual_labse",
                batch_size=200,
                window=50,
                batch_ids=[],
                save_pic=False,
                embed_batch_size=5,
                normalize_embeddings=True,
                show_progress_bar=True,
                shift=0)

After the initial alignment, for each sentence in English, the best match in Russian will be found. To support long texts, alignment is done in batches (segments). There is an overlap between batches (parameter window). The flow of the second text can be moved relative to the first (parameter shift). You can read more about the alignment mechanism here.

Visualization

Let’s look at the result of using the module vis_helper:

from lingtrain_aligner import vis_helper

vis_helper.visualize_alignment_by_db(db_path,
        output_path="alignment_vis.png",
        batch_size=500,
        size=(900,900),
        lang_name_from=lang_from,
        lang_name_to=lang_to,
        batch_ids=[],
        plt_show=True,
        show_info=False)

print("score:", metrics.chain_score(db_path))

Metrics

To assess the alignment, I came up with a metric, the logic of which is in the module metrics… It evaluates how connected the alignment chain is. A chain without breaks must have score = 1, a random set of points will have score = 0.

Conflict resolution

The number of sentences in the texts varies greatly. This is due both to the style of a particular translator and to the peculiarities of a particular language (for example, there is a tendency to translate complex Russian sentences by several in Chinese). To overcome this, we need to glue together sentences of either the first text or the second in certain places. This is done by the module resolver… It resolves the conflicts found in several passes. The biggest conflicts must be resolved manually, for this there is a UI, more about it below. In our case, the quality of the primary alignment suggests that everything should be fine. We will verify this by putting all the dropped out lines in their place.

steps = 3

for i in range(steps):
    conflicts, rest = resolver.get_all_conflicts(db_path,
                        min_chain_length=2+i,
                        max_conflicts_len=6*(i+1),
                        batch_id=-1)

    resolver.resolve_all_conflicts(db_path, conflicts, model_name, show_logs=False)

    if len(rest) == 0:
        break

Let’s look at the visualization:

Result

The picture is beautiful, but let’s see the result. From the base, you can unload the enclosures individually or in TMX format.

from lingtrain_aligner import saver

output_path="/content"

saver.save_plain_text(db_path, os.path.join(output_path, f"corpora_{lang_from}.txt"), direction="from", batch_ids=[])
saver.save_plain_text(db_path, os.path.join(output_path, f"corpora_{lang_to}.txt"), direction="to", batch_ids=[])

saver.save_tmx(db_path, os.path.join(output_path, f"corpora.tmx"), lang_from, lang_to)

Excerpt from corpora.tmx:

Having resolved the conflicts, we received a parallel corpus of 332 lines out of 344 sentences in English and 372 in Russian. As mentioned earlier, you can align books completely in the same way.

Since literary translation sometimes borders on art, some couples still need additional validation. It all depends on the specific translation. In addition, the model can err on short sentences and sentences with a large number of titles and names.

Sometimes the translator is even inclined to “improve” the original. For example, in one of the translations of “The Lord of the Rings” you can find the following description:

Тень улыбки промелькнула на бледном, без кровинки, лице Боромира.

And the original:

Boromir smiled.

II. Fine-tuning for a new language

Let’s go back to minor languages. Although the model is good and out of the box “understands” more than a hundred languages, it will not work satisfactorily with the new one. Let’s try.

Colab

You can look at the experiments and code I have done in this Colab’e

Bashkir language

Let’s try to align the story “Father Yalaletdin” by Mustai Karim in the Bashkir and Russian languages. We do all the same actions as in the first part, we get the following:

We see that the quality is much worse, although quite good. What is the reason for this? With the fact that LaBSE was trained in a small corpus of the Tatar language. These languages ​​are related and sometimes you can get a translation from one to another by replacing some letters.

If we now launch the mechanism for resolving conflicts, then, of course, it will work. However, there will be a significant amount of incorrect permissions. Since this does not suit us, let’s figure out how we can retrain the model and improve the quality of the case.

Fine-tuning

First, let’s remember how Google originally trained its model. The task that the model optimized was translation task ranking… From a given set of translations, it was necessary to find the most correct one (picture from articles):

In a wrapper over the model that I used (and this is a very popular and convenient library sentence_transformers) there is a set of losses that do approximately this.

First, let’s install the dependencies:

pip install transformers sentencepiece sentence_transformers

Let’s import and initialize the model:

from sentence_transformers import SentenceTransformer, SentencesDataset, losses
from sentence_transformers.readers import InputExample
from sentence_transformers.evaluation import SentenceEvaluator
from torch.utils.data import DataLoader

model = SentenceTransformer('LaBSE')

The retrained model can be passed as a parameter to the alignment methods, so we will do this a little later.

After reading documentation, I found several error functions suitable for us. it MultipleNegativesRankingLoss, ContrastiveLoss and OnlineContrastiveLoss… In the last two it is necessary to transfer examples labeled 0 or 1. 1 – if a pair of strings is a mutual translation and the corresponding vectors need to be brought together, 0 – if they need to be pulled apart. MultipleNegativesRankingLoss works in a similar way, by code it can be seen that in this loss for each example from batch correct translations will move closer, and all others will move away. The author of the library recommended using it, and in the course of experiments it really turned out to be more effective than others.

For additional training, you need to bring your dataset with translation pairs to the required form. Of course, before training, you need to pay attention to the quality of the dataset and clean it up. For the Bashkir language, I used the data provided to me by enthusiasts in the person of Aigiz Kunafin and Iskander Shakirov. it open Russian-Bashkir dataset.

train_examples = [InputExample(texts=[x['ba'], x['ru']], label=1) for x in train_dataset]

train_dataset = SentencesDataset(train_examples, model)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size)

train_loss = losses.MultipleNegativesRankingLoss(model=model)

After that, you can train the model, this is done simply:

num_epochs = 3

warmup_steps = math.ceil(len(train_dataloader) * 0.1 * num_epochs)

model.fit(train_objectives=[(train_dataloader, train_loss)],
        evaluator=evaluator,
        epochs=num_epochs,
        evaluation_steps=1000,
        output_path=model_save_path,
        save_best_model=True,
        use_amp=True,
        warmup_steps=warmup_steps)

You can also pass your own class as an evaluator. It will be called every evaluation_steps steps, count your metrics and draw graphs. I added a class ChainScoreEvaluator, which aligns and grades small passages of text in the languages ​​in question.

It should also be noted that although Colab is free, it can issue cards that are not powerful enough for training. This affects the batch size and learning rate. As a result, I signed up for $ 10 per month (about 750 rubles).

Improvement

After training the model in Colab for several days, the following result was obtained:

This quality is already enough to more confidently replace the dropped lines.


Chuvash language

With the Chuvash language, everything was much more complicated, since the original quality was many times worse. The language is further away from its Turkic relatives, which are present in the model.

Thanks for the dataset to Alexander Antonov, popularizer of the Chuvash language. The Russian-Chuvash parallel corpus can be found here… As a result of experiments, it was possible to significantly improve the quality:

Result after automatic conflict resolution:

corpora.tmx

So that you can appreciate the quality of these models, I put together Colab using them. The advantage of Colab is that it provides its own GPUs, so the calculations are much faster. V this laptop you can choose other languages, try it.

Validation

Separately, I will say about checking the resulting case. To improve its quality, you can use the same model to calculate the distance between embeddings (remember that this is just a vector of numbers corresponding to a sentence) and cut off the most distant pairs.

Better yet, involve native speakers. This is what the Bashkir colleagues did, writing a bot that gives a couple of proposals for evaluation. If you speak Bashkir, then connect

Both models can be tried here

UI

To manually resolve large conflicts and edit the corpus, I wrote a UI. I talked about it in more detail here, but it looks like this:

In it, you can not only align and edit corpora, but also make parallel books out of them.

Ideas

The experiments that have been done are probably not the most optimal ones. The quality can be improved by adding data of the same style to the dataset, the documents on which will need to be aligned.

You can also use the fact that related languages ​​have similar grammar and vocabulary, up to the characters of the alphabet. It is possible that when replacing, for example, Cyrillic letters with Latin ones, the quality will additionally increase (for the same Chuvash). This is also to be tried.

If you have any ideas on this, I will be glad if you share.

And yes, I almost forgot – who can guess what languages ​​are indicated on the cover of the article?

Links

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *