Karachay-Balkar translator

This is a short article, but it took a lot of work to write it. It briefly describes the language, how we collected data, and how we trained the models. It's not really a how-to guide, but rather a way to announce what we've done.

About language and people

Since the title says “translator”, it means we are talking about a language. It is spoken by the Karachay-Balkars (officially the people are artificially divided into “Karachays” And “Balkars“) – Caucasians living to the north, east and west of Mount Elbrus, mainly in the Republics of Karachay-Cherkessia and Kabardino-Balkaria.

Karachai flag (tekmet) and Balkar (with Mount Elbrus)

Karachai flag (tekmet) and Balkar (with Mount Elbrus)

The total number of our people is about 435 thousand in the world. We speak Karachay-Balkar language – a Turkic language from the Polovtsian-Kipchak group, close to Kumyk and Crimean Tatar. The number of speakers is approximately 310 thousand people, which is less than the population.

Our language has two dialects with accents that are not very different from each other:

  1. Chokayushchie dialects (example: chach – “hair”):

    1. Dzhokayushchiy (dzhol – “road”). Spoken by Karachays and Balkars from the river valley Baksan.

    2. Zhokayushchiy (zhol). This is what the Balkars from the Chegem River valley say.

    3. Mixed Khulam-Bezengievsky (both Zhol and Zol are possible). Territory — river valley Cherek Khulamsky.

  2. The Balkar clicking and clucking (tsats and zol). Territory – river valley Cherek Balkarsky.

Language problems

Unfortunately, the language is under threat, fewer and fewer people know it. The main problems include:

  1. Small numbers

  2. Children do not actively speak it.

  3. Reducing its teaching in schools (especially when it was made optional rather than mandatory)

  4. A multilingual environment where the language of communication is Russian, etc.

All these factors together cause serious damage. However, there is another very significant problem – our language is about the past: it is used to describe historical names, traditions and life in the village… But not about the present and, especially, the future: it is practically absent in science, technology, modern life, etc. From this it follows that there is no need for the Karachay-Balkar language, when everyone understands Russian and it is possible to realize oneself in it.

Fortunately, there are activists among us who are popularizing it.

  1. There are translations of films, cartoons, music

  2. Applications and games are being created

  3. Adaptations and translations of applications are being made

  4. Books/poems are being written

  5. Blogs/channels, etc. are maintained.

So do we: Bogdan Teunaev And Ali Berberov – created the first translator between the Karachay-Balkar and Russian languages.

Data collection and processing

Here we will move from the general to the more technical.

There is quite a lot of digital data on the Karachay-Balkar language, but in order to train a translator, parallel corpora are needed. At the moment, about 289 thousand parallel corpora have been collected words/phrases/sentences.

This is the longest stage. We collected them for about 2.5 years and still continue to do so. The sources are very diverse: religious texts, fiction, folklore, children's fairy tales, dictionaries, Soviet reports, cartoons.

We aligned about 95% with code + by hand (sometimes we hired people to do the alignment for a small fee). In the process, we trained LaBSE to align automatically, but the quality was a C. We had to filter out most of it, but it sped it up a lot and saved us some effort.

Since the language has little difference in dialects, the parallel corpus led to basic typeThis allows you to train one model and then apply it to different dialects.

Model

It all started with a model that was written in tensorflow in the R language and about 10 thousand pairs. Then we trained mbart-50 by adding new tokens and language to the tokenizer.

In the end we came to NLLB-200retraining several times as the corpus grew. It is still the SOTA of the translation models world. It already had various Turkic languages, especially Crimean Tatar. This allowed the model to easily learn the Karachay-Balkar language. First, we “introduced” the model to our language (one era was run through the entire corpus), and then we trained it to translate (throwing out the dictionaries, leaving only long sentences).

You can't train such a model on a laptop or computer (although you can try, but it will take a long time), so you have to buy server capacity. For example, on the A100 (video card) we trained our case in a day and it's still fast.

As a result, we have modelwhich translates between Russian and Karachay-Balkar. It has a number of shortcomings: it tries to translate proper names and the meaning is often distorted, it doesn't understand complex sentences well… But all this can be solved by increasing the data corpus. It can also normally translate only one sentence at a time due to the peculiarity of NLLB-200 (it was trained by sentence).

There are also translator interface. In addition to the translator itself, there is a built-in dictionary and voice-over of the Karachay-Balkar text. We did not train the voice-over, it is an open-source model. Unfortunately, there is no website with its own domain at the moment.

Future plans

  1. Collect a monocorpus of the language and train a language model in the Karachay-Balkar language.

  2. Collect parallel text and speech corpora to train models that voice text and translate speech into text. Perhaps this will help those who voice films, cartoons, or translate videos.

  3. Add other languages ​​to the translator so that they are translated into Karachay-Balkar.

  4. Move to a normal domain.

All this takes a huge amount of time (especially data collection) and requires funds. We do it with enthusiasm in our free time. We will always be glad to have people join us to develop our language faster in the digital world and make it about the present and the future.

Also our other projects:

  1. Ali made a game called “Terekle”, where you need to guess the words: download on Android, in the browser.

  2. Bogdan translates the telegram into the Karachay-Balkar language by clicking on linkyou can install it.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *