How to save millions of rubles, or translation of documentation using neuroengines

Hi all!

My name is Alexander Denisov. I work at Naumen and am responsible for documenting and localizing the Naumen Contact Center (NCC) software product.

In this article I will talk about how we automated the process of translating documentation using neural engines without using CAT systems or any other translation tools.

A little about the Naumen Contact Center product

The Naumen Contact Center product is a very complex software for organizing contact centers, which contains:

We localize and translate documentation into English and German.

How documents are most often translated

As usual, any more or less large company translates documentation:

  • Contacts the translation agency, contacts the manager.

  • If the company is already more or less experienced, then it prepares a glossary.

  • Sends a glossary and texts that need to be translated to the manager by email.

  • Almost any translation agency uses a CAT system into which the manager uploads texts and a glossary.

  • Next, either the translator immediately translates the texts sentence by sentence, or the texts are first translated by a linked neuroengine.

  • Further, depending on the quality requirements, proofreading can be performed.

  • At the end, the manager uploads the results of the translation and sends it to the customer company along with the act and invoice, and the company pays for the work.

This process can be represented in the diagram as follows.

Scheme of working with a translation agency

Scheme of working with a translation agency

The translator interface in the CAT system (Computer-Aided (Assisted) Translation) looks like this:

CAT system interface

CAT system interface

On the left side you can see how all the texts in the CAT system are divided into sentences, in the left column – the source text, in the right – the translated text. On the right side you can see that for the highlighted line in the glossary we have, for example, the term “Operator”, which should be translated as “Agent”, and there is also the term “Project”, which should be translated as “Campaign”. And then there is such a problematic term as “Dialing”, for which there is no normal translation in English. In the documentation of such contact center solution developers as Cisco or Avaya, the combination “Outbound Dialing Campaign” appears everywhere. And what’s interesting is that this combination of “Outbound Dialing Campaign” contains the word “Campaign”, which complicates everything even more. That. The translator has the difficult task of correcting all this terminology manually.

Problem

At first, we translated the documentation with the help of a translation agency into English, partly into German, and spent a lot of money. But then they didn’t update the translations for several years, because sales didn’t go well. As a result, the documentation is very outdated.

But a partner comes and says that the documentation is outdated, but he somehow wants to sell it. And here's the actual problem:

  • The average cost of translation is 2-3 rubles per word.

  • 250 words per page.

  • 3,000 – 4,000 pages of complex technical documentation.

  • 2 languages ​​– English, German.

  • We need 3-6 million rubles.

  • Terms up to six months or more.

  • The quality still needs to be checked.

But we need everything yesterday, but no one will give money, there are no clients yet. Moreover, they already gave me money once, but there was no result.

Solution

In this situation, there was no way out except to raise the question: is it possible to simply translate everything using a neuroengine? And looking ahead, I’ll say that yes, it’s possible! And below I will try to explain why this became possible for us, and what we have already achieved.

So why did neuroengine translation become possible?

For many years we have been told that the quality of automatic translation is still not up to par, so why is it already possible:

  • Firstly, the quality of neuroengines is getting better every year, and at the moment it is comparable to the quality of the translation of an average translator, yes, in some places it’s worse, but in others it’s even better.

  • But we don’t need super-quality for our purposes. We need to translate technical documentation, the main thing for us is that it can be used to solve the problem. Small roughnesses don't bother us much. And remember Microsoft, they have been translating their documentation for a long time (the quality, I must say, is terrible, but still), if Microsoft can do this, then why can’t we. And at the current stage, we don’t even know whether anyone will read our documentation or whether it will go to the desk.

  • How do they transfer to a translation agency? More and more translation agencies are also translating using neural engines. And to verify this, an independent proofreading is needed. So why don’t we do this initial work ourselves, i.e. we can know exactly what quality we have and then make decisions about what to do with it next.

  • As I said above, we need a glossary that is loaded into the CAT system. Recently, translation engines have learned to use the glossary on the fly and produce a translation taking into account the glossary, while CAT systems do not yet know how to transfer the glossary to the neural engine or are just learning how to do this.

That. it was decided that we just had to try it!

Of course, at first it was interesting to understand which neural engine to choose, which one is better. And I started testing.

Choosing a neuroengine

Initially, it was known that DeepL was considered the best, but I wanted to see for myself which engine translates better and which one works better with the glossary.

At first, I simply tested the engines that I managed to connect to the CAT system and it became clear that it was impossible to use a glossary on the neuroengine side there.

After that, I started testing engines that can work with a glossary, I chose 3 engines: PROMPT, Yandex and DeepL. Below is a small comparison table.

Yandex

DeepL

PROMT

Type of use

Cloud

Cloud

Placed within the company's boundaries, difficult to implement

Working with the glossary

Fine

Great

Okay, but the glossary doesn't apply to verbs

Payment

Cheap, payment by card in rubles

Cheap, there are difficulties with payment from the Russian Federation

Expensive to implement, free after implementation

The difficulty of the choice was that at the time of the start of testing DeepL did not have the ability to use a glossary when translating from Russian into German, so initially it was decided to do a double translation from Russian to English using Yandex, and then translate from English to German using DeepL. But soon in DeepL it became possible to use a glossary from Russian into both English and German, and testing had to start all over again.

Why is a glossary important?

First of all, I was interested in the glossary, because as I said above, we have specific requirements for it and therefore it is very important for us.

In the image below you can see an example of the glossary being used by neuroengines. At the top is an example of translation in the CAT system without using a glossary, and below is translation directly through the Yandex and DeepL APIs.

That. you can see that our combination “calling project” is translated as follows:

  • Without a glossary – “calling project”.

  • Yandex with a glossary – “outgoing outbound dialing campaign”.

  • DeepL with glossary – “outbound dialing campaign”.

As you can see, Yandex is not bad at using a glossary, but in such complex cases as ours, it actually duplicates words, which, of course, needs to be corrected manually. But, as you can see, DeepL solves the problem perfectly! We saw the same result with such a combination as “outgoing calling”; DeepL coped better with it too. That. we realized that DeepL is better for us.

Neuroengine translation speed

Neuroengine translation is very fast:

Cost of neuroengine translation

Neuroengine translation is very cheap. 2-3 rubles per word turns into 0.003-0.004 rubles per word:

  • Full translation: 2-3 million rubles (with the help of a translation agency) turn into 2,000 rubles using a neuroengine.

  • Two-week update – a couple of rubles.

But what if you still need excellent quality?

Let's assume that we translated everything simply through the neuroengine API, but what if we still need to improve the quality:

  • Then we need to save all the translations and ensure that we can load these translations into the CAT system in order to give them to the translator for proofreading. To do this, translations can be saved in an XLIFF file.

  • Quality can be improved gradually, for example, as comments are received. To do this, it is necessary to organize feedback. In my case, I process comments from a partner.

  • Initially, you can see which of the translated materials should be of high quality. You can build statistics on viewing documentation in the source language and make only the pages with the most views high-quality.

Automation of the translation process

Let's go back to the diagram that I gave at first. As a result, we do not need all the interaction with the translation agency and the CAT system.

All you need to do is send the text via the API to the neuroengine, receive the translation and save it. Thus, as an R&D, a Python script was written that performs the following actions:

  1. Downloads a GIT repository containing the project with the source documentation.

  2. Downloads an XLIFF file with bilingual texts from the GIT repository and loads it into memory (if this is the first translation, then the file is empty).

  3. Uploads XLIFF files via API from WebLate (the tool in which we do localization), which can also be useful in translation.

  4. The entire source text is divided into paragraphs. For each paragraph:
    – Checks whether there is a translation in XLIFF from WebLate. If there is one, he substitutes it.
    – Checks whether there is a translation into XLIFF from the neuroengine. If there is one, he substitutes it.
    – If nothing is found in the saved translations, it sends the text and glossary for translation into the neuroengine. The results obtained are substituted into a translated file and saved in XLIFF from the neuroengine.

  5. Saves translations to a GIT repository.

This can be represented in the diagram as follows.

Scheme of the script

Scheme of the script

After this, documentation can be assembled automatically in all languages ​​at the same time.

A little about the structure of the documentation project

The documentation is developed in MadCap Flare and the project structure can be roughly represented as in the image below.

Structure of the documentation project

Structure of the documentation project

Each rectangle is a file that has a specific purpose:

  • Target is a file that is responsible for assembling documentation. Each format has its own Target file, but the same TOC file can be linked to it.

  • TOC (Table of Content) – table of contents. It can have a tree structure and include other TOC files; HTML files are attached to the table of contents like leaves to a tree.

  • HTML – topics that contain text in HTML format. HTML can contain Snippets, images (PNG files) and Variables.

  • Snippets are small pieces of text in HTML format. They can be embedded in large HTML files and also contain other snippets, variables or images.

  • PNG – images (screenshots or diagrams).

  • Variables – XML files with strings (variables). Variables can only contain unformatted text and can also contain another variable. It is assumed that the variable can change dynamically.

The translation script is given a TOC file as a parameter and it extracts all the texts from all the files in a chain and translates them.

Using Variables with Interface Strings

The neuroengine cannot guess the correct translation for interface elements. As you can see in the image below, the neuroengine translated the name of the form differently than in the screenshot.

To solve this problem, instead of interface texts, we use variables that we obtain by converting resource files from the development repository into the MadCap Flare variable format.

The HTML code with the variable in this case looks like this:

<p>В открывшейся форме <MadCap:variable name="PMS/outcallproject.setTemplate.form-title" class="PMSHeaderBlock" /> выберите созданный шаблон.</p>

That. texts from the interface always correspond to the interface and screenshots (if they are not outdated).

Untranslatable parts of a string

During testing, a problem arose with the fact that different names of services, variables, parameters and other words in sentences in English can:

  • Influence the context. For example, two identical sentences, but containing a different variable name, could be translated differently, which is not nice, especially when the sentences follow each other, for example in a list or table.

  • These English names themselves may deteriorate and become invalid. For example, the case of a variable name may change after translation.

To solve this problem, it was decided to replace everything that should not be translated with placeholders using regular expressions. That. The text with placeholders is sent to the neuroengine for translation, the translation with placeholders is returned, and then the placeholders are replaced back:

<trans-unit id="61325d88-5203-11ee-acd9-6255c1ae6d7b">
        <source>
            В конфигурационном файле _plchldrid001_ для параметра _plchldrid003_ установлено значение по умолчанию _plchldrid005_ в семплах (см. раздел _plchldrid007_).
        </source>
        <target>
            In the _plchldrid001_ configuration file, the _plchldrid003_ parameter is set to the default value _plchldrid005_ in samples (see section _plchldrid007_).
        </target>
</trans-unit>

What else can you do when the translation is automated by a script?

Now that the translation is performed by a script, the question arises: what else can be automated? Yes, anything! You can do any checks and transformations.

For example, as you know, English has special rules for writing headings. The first letter of each word must be capitalized. It was not difficult to find the library and, after translation, convert the texts in the h1-h6 tags into the title format. The image below shows how the translation looks in the CAT system and what translation is automatically obtained after translation by the script below.

Side benefits

During the translation process and when checking the results, errors are periodically found in the source materials:

  • Incomprehensible text in translation may simply be due to the fact that it is poorly written in the source code. In this case, we immediately do the task of editing the source text and do not touch the translation. After changing the source text, we simply perform the translation again.

  • Incorrect text in translation may be due to the selection of incorrect terms or incorrectly translated terms. In this case, we are finalizing and correcting the terminology in the source texts and expanding the glossary.

  • Sometimes, when parsing the project structure, the script finds various other problems that we also fix, for example:
    – Broken links to topics, images, snippets and variables.
    – Using “prohibited” tags in HTML.

conclusions

As a conclusion, we can say that 99% of the text translation was automated. Yes, sometimes there are some problems that have to be fixed manually, but, in any case, all these problems are insignificant compared to the amount of work that is performed automatically.

At the moment, the vast majority of documentation has been translated and partners are already using it.

There are many more ideas on what can be automated and what we are already trying to do, I will try to talk about this in the following articles.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *