One small feature of the Russian language

For some reason, experienced colleagues don't like to talk about this “feature”, and the first meeting with such a person in your project guarantees sleepless nights and foreheads and keyboards smashed against the wall. Read and take care of your nerves, they say they don't recover.

Let's start suddenly from this simple poem:

A nightingale and a fox are in the forest, building a house in the canopy.

We take this line and insert it entirely into the code in your favorite programming language, then try to find, for example, the word “forest”.

This time, the favorite language was chosen Kotlin:

fun main() {
    val quote = "cоловей c лиcой в леcу, cтроят домик навеcу"
    println("найдено: ${quote.contains("лес")}")
}

Running the code above, we get a surprising result:

Suddenly. Sometimes your eyes lie to you.

Suddenly. Sometimes your eyes lie to you.

But of course one example will not be enough for understanding, so here is another one:

Letter B – Bactrian camel,
He is big and very proud.
A camel has two humps,
And the letter B has two of them.

Let's check if the verse above contains the word “two”:

fun main() {
    val quote = "Буква B — Beрблюд двугopбый,\n" +
                  "Он большой и очень гopдый.\n" +
                     "У вeрблюдa двa горбa,\n" +
                      "И у буквы B их двa."
    println("найдено: ${quote.contains("два")}")
}

And.. again gray night bummer:

Well, how is that possible?

Well, how is that possible?

Maybe it's a bug in the compiler or… in the Kotlin language itself (at the concept level, yeah)!? Who knows what your modern, clumsy “pogromists” have stuffed in there!

I thought so too (no) and took it axe good old C++:

#include <iostream>
int main(int argc, char **argv) {
    std::string quote = "Буква B — Beрблюд двугopбый,"
                        "Он большой и очень гopдый."
                        "У вeрблюдa двa горбa,"
                        "И у буквы B их двa. ";

    if (quote.find( "Верблюд") != std::string::npos) {
            std::cout << "Нашлось!" << std::endl;
    } else {
            std::cout << "неа" << std::endl;
    }
    return 0;
}

AND… Nodoes not work:

When even an axe didn't help.

When even an axe didn't help.

Well, okay, apparently poetry is not my thing, I need something more scientific and meaningful.

For example, here is a quote from Wikipedia:

All-Russian classifier of objects of administrative-territorial division (abbr. OKATO — All-Russian classifier of administrative-territorial entities) — classifier objects of administrative-territorial division Russian Federationis part of “Unified system of classification and coding of technical, economic and social information of the Russian Federation» (ESKK). OKATO is designed to ensure the reliability, comparability and automated processing of information in the context of administrative-territorial division in such areas as statistics, economy and others.

This time, we'll remove all development environments and compilers – “they're lying to you anyway” (c), so we simply open “developer mode” (F12 key) in our favorite Chrome browser and write in the JavaScript console:

let quote = "Общероссийский классификатор объектов административно-территориального деления (сокр. OKАТO — общероссийский классификатор административно-территориальных образований) — классификатор объектов административно-территориального деления Российской Федерации, входит в состав «Единой системы классификации и кодирования технико-экономической и социальной информации Российской Федерации» (ЕСКК).";

then we add the search condition:

quote.includes('ОКАТО')

And… it won't be found:

And my favorite browser didn't help either, oh horror!

And my favorite browser didn't help either, oh horror!

Now we write the second part of the quote in the same console:

let quote2 = "ОКАТО предназначен для обеспечения достоверности, сопоставимости и автоматизированной обработки информации в разрезах административно-территориального деления в таких сферах, как статистика, экономика и другие.";

And one more check:

quote2.includes('ОКАТО')

And… it suddenly works:

"Well, how is that possible?" Part two.

“Well, how is that possible?” Part two.

So, do you still want to “get into IT” and become a programmer? Maybe driving a tractor and chopping wood isn't such a bad idea?

How does it work?

If you have small children, show them the two pictures below and ask them to find the similarities and differences, they will do it very quickly 😉

Here is the first one:

And the second:

As soon as the child pokes a finger at a couple of squares, the source of the problem will immediately reach you, in the most natural way. But if there are no children at hand, I will tell you in my own words, although it will not be as effective:

It so happened historically that some of the symbols in modern English and Russian are visually very similarbut technically they are different.

On standard office fonts you will not see any visual difference at all, only if you specifically take a font with stylization you will be able to see the differences:

Crooked 'c' - from the English alphabet

Crooked 'c' – from the English alphabet

Now about the technical part.

Computers, as we know, operate with numbers and not strings. Each symbol in a string has its own numerical code, any comparison of strings and searches on them occur using these same numerical codes.

Take a look:

99 is the code for the Latin character 'c', 1089 is the code for the Cyrillic character 'c'

99 is the code for the Latin character 'c', 1089 is the code for the Cyrillic character 'c'

The first character 'c' is Latin, the second 'c' is Cyrillic. Visually they are twin brothers, but the codes are different.

This is precisely why a head-on comparison doesn't work:

Do you see the visual difference between 'c' and 'c'? There is one.

Do you see the visual difference between 'c' and 'c'? There is one.

There is a very simple way to check – recoding suspicious text into pure ASCII:

Recoding will immediately highlight the problem, since only Latin characters will remain after it. But unfortunately, it is not always possible to use this approach:

Most text input occurs in form fields in a browser or application, so it is impossible to implement such a check there in some universal way.

How serious is this?

All computers used in the Russian Federation have at least two input languages ​​- Russian and English, between which users switch during work. Therefore, interaction and data input during work constantly occur on in two languages.

Wherever there is filling out of any forms and user input, there is the described problem with similar symbols.

Most often, mistakes are made with the symbols 'c' and 'c', since the same key on the keyboard is responsible for them, much less often with all the others.

Not only users make mistakes, but also the developers themselves:

The Russian 'c' got into the name of the JPA entity field and went to the database in this form.

The Russian 'c' got into the name of the JPA entity field and went to the database in this form.

As you can see, the problem is widespread and serious, since if text with an incorrect 'c' gets into the search index, for example, it will break your search – such a line will simply not be in the search results, although the problem will not be visible either technically or visually.

Autocorrect

I wrote a small class in Kotlin to automatically replace similar characters — as a simple solution to the problem described. It can be easily adapted to your realities.

Published in the form gist on Github, the code looks like this:

package com.x0x08.yoba

/**
 Класс для поиска и автозамены визуально похожих символов латиницы на кириллицу:
 'c' -> 'с' и другие
 */
class Matcher {
    /**
     * Находит и заменяет похожие латинские буквы на кириллические
     * @param input
     *          входящая строка
     * @return
     *         строка с замененными символами
     */
    fun replaceSimilarRuEnChars(input: String): String {
        val chars = input.toCharArray()
        for (i in chars.indices) {
            val c = chars[i]
            // символ в нижнем регистре используется в качестве ключа
            val cLow = c.lowercaseChar()
            // поиск по словарю
            if (RU_EN_MATCH.containsKey(cLow)) {
                // замена
                chars[i] = RU_EN_MATCH[cLow]!!
                // если оригинальный символ был в верхнем регистре - ставим его и у замены
                if (Character.isUpperCase(c))
                    chars[i] = chars[i].uppercaseChar()
                println("найден ASCII символ: '$c' , заменен на: '${chars[i]}'")
            }
        }
        return String(chars)
    }
    companion object {
        // справочник заменяемых символов
        private val RU_EN_MATCH: MutableMap<Char, Char> = HashMap()
        init {
            RU_EN_MATCH['c'] = 'с'
            RU_EN_MATCH['b'] = 'ь'
            RU_EN_MATCH['o'] = 'о'
            RU_EN_MATCH['p'] = 'р'
            RU_EN_MATCH['x'] = 'х'
            RU_EN_MATCH['m'] = 'м'
            RU_EN_MATCH['h'] = 'н'
            RU_EN_MATCH['e'] = 'е'
            RU_EN_MATCH['t'] = 'т'
            RU_EN_MATCH['k'] = 'к'
            RU_EN_MATCH['a'] = 'а'
        }
    }
}

fun main() {
    val m = Matcher()

    println("Тест 1")

    var quote = "cоловей c лиcой в леcу, cтроят домик навеcу"
    println("найдено: ${quote.contains("лес")}")

    quote = m.replaceSimilarRuEnChars(quote)
    println("теперь найдено: ${quote.contains("лес")}")

    println("Тест 2")

    quote = "Буква B — Beрблюд двугopбый,\n" +
            "Он большой и очень гopдый.\n" +
            "У вeрблюдa двa горбa,\n" +
            "И у буквы B их двa."
    println("найдено: ${quote.contains("два")}")

    quote = m.replaceSimilarRuEnChars(quote)
    println("теперь найдено: ${quote.contains("два")}")
}

The tests are exactly the same poems given at the beginning of the article.

Now everything works.

Now everything works.

Enjoy it in good health.

Other languages

Surprisingly, other European languages ​​have a similar problem. No:

visually similar to Latin letters of French, German or Spanish have the same code, so recoding to ASCII kills only a small part of the text where there is a language specificity:

Example with text in French.

Example with text in French.

And in German, only one symbol was killed.

And in German, only one symbol was killed.

This is the specificity of the “great and mighty”, which arose due to historical reasons and circumstances.

P.S.

This is a slightly edited version of the article, trash original which is available on our blog.

0x08 Software

We are a small team of IT industry veterans, we create and refine a wide variety of software, our software automates business processes on three continents, in a wide variety of industries and conditions.

Bringing it to life long dead, we fix what never worked and we create impossible — then we talk about it in our articles.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *