LangBar++. Automatically correct the layout of typed text using Hunspell dictionaries

In the previous article about the program LangBar++, busy manually correcting the layout and converting the text, said that there is “no good” in automatically correcting the layout, and that ideologically I am against such a dominance of the machine over the spirit. But the question remained and haunted me. And if the machine is still taking its toll, why shouldn’t I take part in it to the best of my ability?

My first thought was how to do this as simply as possible and without unnecessary bloodshed. And if we’re going to do something, it won’t be a copy of Punto Switcher, but something quite universal that can be used not only with a combination of English and Russian languages.

I had no idea about the principle of operation of such programs, but gradually it became clear that the process of automatically correcting the layout is based on three things:

  1. Hook that intercepts keystrokes

  2. Parallel recording of text in two languages ​​- the current and alternative, which would be entered if another layout was active

  3. Verification of these two text streams with the corresponding dictionaries, and, if there are no matches in the current layout and presence in the alternative, correction of the layout by erasing the old text and entering a new one

There was already a hook in the program, used to manually convert the typed text. The second question was more complicated. The Russian-English combination is the most trivial option, since English and Russian keys have a one-to-one correspondence, albeit varying depending on the layouts used. When it comes to languages ​​like French or German, things get much more complicated. Dead keys and diacritics appear, typed with the right Alt. In addition, the keyboard itself changes the virtual key codes, so the ground is pulled out from under your feet.

There was no clarity with dictionaries either. Relatively simple dictionaries are sufficient for conversion; fortunately, the conversion mostly occurs in the first three or four characters. For initial experiments, it was decided to take Hunspell dictionaries, which are lists of words, remove from them everything related to affixes, i.e., to the formation of word forms, and use them.

After the first experiments with the English-Russian combination, it became clear that even in the scraps of the first three, four, five, maximum six characters, there is a loss of words on which the automation should be triggered. Unpacking the Russian Hunspell dictionary into 150,000 lines gives one million three hundred thousand word forms! You have to use Hunspell itself, and there’s no getting around it.

Here we should talk about the advantages that come from using Hunspell in software. Designed as a spell checker, it can check the final form of a word while providing correction hints for misspelled or incomplete words. We load the entered characters as is, and also, if the word is incomplete, we check the version with an asterisk * at the end and extract the necessary clues. We just need to filter them, and the verdict “suitable or not suitable” is in our hands.

Separately, it should be said about the benefits of using Hunspell in relation to languages ​​that use diacritics and dead keys. The fact is that the dictionaries were compiled for our chaotic liberal times, and assume the greatest user discretion. If you want to use diacritics, umlauts and everything else, use it, but if not, we will check your simplified text, as long as it does not go beyond the known limits. For example, characters with diacritics placed on individual keys often cannot be ignored. At the same time, in the Russian language replacing е with е is acceptable, but in Belarusian it is not. Hunspell dictionaries accumulate generally accepted ideas about such user freedom and its boundaries.

Hence the fundamental principle of building the program: we will monitor text input in two languages, load these parallel texts into Hunspell, as well as their duplicates without diacritics (there are few of them in dictionaries!) and verify them. In this case, the converted text will be the result of recording physical (!) keystrokes in the new layout, and not these, perhaps not entirely accurate, attempts to determine alternative text. So anyway, the user gets exactly what he entered.

The problem of dead keys is solved along with this: diagnosing them programmatically is not difficult, we will only need to exclude their pressing and feed Hunspell characters devoid of diacritics. This way, you can do without configuration files in which the features of layouts will be specified (and there are many of them, some official ones).

The result should be a universal tool for correcting the layout of typed text between fairly different languages. For example, Russian-Belarusian or Russian-Ukrainian combinations work mediocrely due to similarities in vocabulary and layouts, but English-Russian or French-German combinations work without problems, in the latter case excluding some common or newfangled words. There are Hunspell dictionaries for almost a hundred languages. I’m not a polyglot to check this, but with sufficient lexical differences, everything should work in relation to the languages ​​of Europe (+America, where I come from) and the former USSR.

Another feature provided by the program is the single language mode. That is, you type, for example, an English word on a Russian, Ukrainian or Belarusian keyboard layout, and it is automatically corrected by switching to the English layout. In this case, the reverse transformation is not performed! It is only important that you have dictionaries for each of these languages, otherwise there will be many false positives (the program checks for their presence).

The program settings are quite simple:

There is voice acting, on-screen notifications, the ability to cancel conversion and word processing from three characters (to avoid false positives). You can also turn off the processing of abbreviations and turn off the processing of individual letters (Punto Switcher does not process individual letters, here it is possible, in the worst case, with minimal editing of the dictionary). You can explicitly set word boundaries, and you can prevent auto-conversion after certain text editing actions.

The inclusion of automation is displayed on the tray icon and the checkbox next to the text cursor. Plus, there is a tool that allows you to activate or, conversely, disable automation in the windows of programs of interest.

Now about the possibility of expanding the vocabulary, which is probably imperfect. At a time when simple text dictionaries were still supposed to be used, a tool was created for creating them from scrap material. You load a text file or several files, it breaks it down into words, filters it according to a certain principle and produces a dictionary with one word per line. This remedy is called Dictionary converter and looks like this:

To use, just drag the file(s) onto the button or paste the contents from the clipboard to get the result. You can simply parse the text with the necessary vocabulary into words and add it to the main dictionary by placing the file in the appropriate directory. Duplicates of commonly used words will be absorbed when loading the dictionary, and everything unique will participate in text processing. You can, by turning on the appropriate filtering, get a dictionary of abbreviations, etc.

Question: so does this work out of the box?
Answer: in general, yes. With the six languages ​​that come with the program (Russian, Belarusian, Ukrainian, English, French, German), everything should work without problems, with the caveat that the first three are similar and with the exception of rare coincidences that interfere with the automatic correction of the layout.

To catch them, a means of visualizing the work of the program has been made in the form of a tooltip that displays the found words in the main and alternative layouts, and, in brackets, suitable dictionary entries:

en-US  tot tot (tot, toto)
ru-Ru  еще еще (еще)

Here is a case of English and Russian layouts, explaining why the character set tot does not lead to conversion to a word еще (you need to comment out the word tot in the English dictionary). Such cases are few in number and relate only to short words. During all the time I worked in an English-Russian pair, I came across half a dozen such words.

What else? There is a dictionary of exceptions where you can enter malicious words (I still don’t use this), a history of auto-conversion cancellations, and simple statistics on the operation of the automation that I used when fine-tuning the program:

The result was a development of the ideas of “Punto Switcher and K”, but open, more customizable, extensible and multilingual. Combined with much better visualization, powerful manual text conversion and adaptability to work with a variety of layouts.

You can taste it

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *