Our experience in creating a contextual translator

Project selection

It all started in the fall of 2017, when we already had experience in developing web applications. We were looking for a project that satisfied the following conditions:

  1. Can be done with a small team.

  2. Proven idea and great growth potential.

  3. Understanding how to do better.

  4. Optimal development time and achieving self-sufficiency.

  5. No problems with copyright holders or the law.

We settled on choosing a project to create a contextual translator, because at that moment it suited us in all respects:

  1. We already had a team then.

  2. We found context.reverso.net and linguee.com. The main idea, in our opinion, was to have examples of the use of the desired phrase/word and their translation in the context of the sentence. The second point is the translations of the phrases themselves – a big plus compared to most classical dictionaries. These sites were already doing well then. Reaching even a tenth of their traffic would be a success. Even if we don’t succeed in surpassing our competitors, we’ll take away market share and that’s enough for us.

  3. We will make it better by adding new content and other translation language pairs. We will extract new unique content by parsing sites that have different language versions and pdf documents with the same text in different languages. There are many languages ​​in the world, but competitors presented only the main ones – there is room to expand. There is also data for many languages, at least for subtitles. In the future, we could look into translating texts.

  4. We’ll do it quickly, because we already know how to look for occurrences of words in a sentence, the data is publicly available, and we’ll learn how to highlight translations beautifully.

  5. The data, at that time, was mainly UN corps, Europarl, subtitles. Since the sites described above have no problems using them, then everything will be ok with us.

As I already said, development began in the fall of 2017, while simultaneously completing old projects. After 3 months, it became clear that it was not possible to do it quickly, or rather, what happened did not suit us in terms of quality and time of work. They launched about a year later. By the time of launch, 6 people were working on the project.

What we worked on

1. Volume and quality of parallel proposals

Parallel sentences are the original sentence and its translation. The totality of such proposals is a parallel corpus.

While the quality of the UN and Europarl corpuses can be trusted, there are problems with subtitles and other sources: incorrectly constructed sentences, poor-quality translations, obscene language, and even spelling of words using characters from different alphabet so that the word looks adequate, but cannot be found in the search. Parsing new examples, filtering and comparing them – one might say, creating your own parallel corpus. We were engaged in alignment (comparing phrases with their translation into another language) of subtitles in different languages. Subsequently, we abandoned parsing new data and independently aligning subtitles, because Enough data began to appear from open sources in already processed form and we simply did not have time to process everything. The base in some areas exceeds 300 million offers. Large volume complicates and slows down the implementation of changes. Considering the large number of directions (78 as of today), it is necessary to maintain a balance between the volume of corpuses, the number of directions and the speed of making updates.

2. The quality of matching words and phrases to each other in parallel sentences.

As a result of processing corpora of parallel sentences, a table of correspondences between source words and their translations is obtained. If you play around, you can compare phrases with each other. In practice, we had to do a lot of shamanism, adapting the algorithm for translating phrases for each language, adding rules that take into account the characteristics of the language and its grammatical structure.

In the correspondence table we also write down a link to the sentences in which they are found. At the user's request, using this table, you can quickly display translation options and highlight them in the original sentences. Thus, the user can choose the desired translation option based on the context of use of this phrase.

If the translation of a phrase is repeated several times, we can conclude that the translation is correct. The main thing is not to rely on data scraped from sites with specific content of the same type translated through an automatic translator.

3. Search speed and layout of results

To increase speed, we optimized the storage format of the correspondence table (queries, translations and related information). Configured caching.

In cases where the user’s entire request cannot be found in the sentences we have for displaying translation examples, we break it into smaller fragments (subphrases). Depending on the request, in addition to the correspondence table, we use solr to display the most relevant examples. Unfortunately, solr is not always stable, but we have learned to live with it.

After receiving sentences with examples, we sort them by relevance, and for long sentences we make snippets (we hide parts of the sentence that are of little significance for the current request).

4. Website: user friendliness, ease of indexing for search engine bots

We tried the display format as in classical dictionaries: sequentially displayed examples of use for each translation option. This worked well for translating individual words, but did not work well with phrases. As a result, in order not to confuse users, we decided to leave the previous option.

Most of the visitors to such sites come from search engines. Therefore, in addition to working on ease of use, a lot of time and effort was spent on thinking through, implementing and testing various internal linking schemes.

5. Copyright holders

Questions from copyright holders still appeared. Requests to remove content came from representatives of brands (interestingly, they also write to filmmakers so that the word “jacuzzi” should not be used in films) and from people who want their last names or even first names not to be used anywhere. There were also interesting moments, for example, a letter came with a request to show the source of the proposal. The man found references to his relatives whom he was looking for. We try to indicate the exact source of translation examples, for example, a specific movie, pdf file or URL. However, in this case there was a flaw on our part, we corrected it and sent a link to the source. I hope relatives have been found.

Why not use google.translate or chatGPT?

We do not set ourselves the task of translating the entire sentence, as in Google Translator. With neural networks, you just have to trust the translation. Don't trust? You will have to correct it if you have competencies, or check with real-life examples that you will find on our website :).

In addition, the contextual translator can take into account the grammatical and syntactic features of the source and target languages, which allows for a more natural and competent translation. By analyzing the context of the translation of the source word or phrase, you can achieve a more accurate result, since you will have an understanding of in what situations a particular expression can be used.

Development path

Traffic on the project began to grow noticeably around 2020, accelerated in mid-2021, and at its peak, in July 2022, was about 600 thousand people per day. In August 2022, the trend changed. As I said above, our project, those that we took as a guide, and those similar to them are highly dependent on traffic from search engines, which can change dramatically in a short period of time. Of course, we asked ourselves questions: why could this be happening, why are some people luckier, and where are the people using sites with context now? There are no exact answers to these questions; perhaps, for many, ordinary dictionaries with translations and/or explanations of the word in the same language, but from reputable publishers, are enough. Maybe it’s enough for people to get a translation from artificial intelligence; now search engines often answer such requests themselves.

Since its creation, the site has acquired additional functionality: search for synonyms, conjugations and declinations, grammar exercises based on real sentences from life, a telegram bot has appeared. Also released application for android. Separated from the project are services for searching words for games and crosswords, a site for selecting rhymes based on pronunciations, and a service for searching for words or phrases within sentences, where you can additionally specify the parts of speech that are used with the original query.

Where to go next

At the moment, focusing on the quality of proposals and adding new directions does not pay for itself. We have launched additional related services, let’s see if they are in demand. Emphasize learning, add more exercises and tests based on sentences from different corpora?

I will be glad to receive feedback, ideas and proposals for cooperation.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *