The simplest algorithm for breaking a word into syllables

One day in my practical session [по ЯП] I, bored, looked at the list of students of the group. The eye caught on the accent sign in the name Lemzekov, which I put [для себя] after pronouncing the student’s last name incorrectly. I mentally read this last name by syllables, and then the question arose: “what algorithm does the brain use to break words into syllables?” For some reason, intuitively it turns out “Lem-ze-kov”, and not “Le-mze-kov” or “Lem-zek-ov”. I wrote more a few examplesand looking at them thinking about how to translate this into an algorithm.

The algorithm turned out like this (I give almost the same Python code that I wrote in pencil then in class).

slog_start = 0
i = 0
while i < len(word):
    if word[i] in vowels_set:
        vowel_pos = i
        i += 1
        while i < len(word):
            if word[i] in vowels_set:
                if i - vowel_pos == 1:
                    hyphens.append(i)
                elif i - vowel_pos == 2:
                    hyphens.append(i - 1)
                else:
                    hyphens.append(vowel_pos + 2)

At this point, it dawned on me: it’s enough just to know the distance between adjacent vowels (let them be in positions a And b) – if it is equal to 1, then insert transfer in position bif equal to 2, then in position b − 1, otherwise [т.е. когда расстояние больше 2] in position a + 2.

It turns out this Python code:

word = input()

vowels_set = set('аеёиоуыэюяАЕЁИОУЫЭЮЯ')
vowels = []
for i in range(len(word)):
    if word[i] in vowels_set:
        vowels.append(i)

import collections
hyphens = collections.deque()
for i in range(1, len(vowels)):
    a, b = vowels[i-1], vowels[i]
    if b - a == 1:
        hyphens.append(b)
    elif b - a == 2:
        hyphens.append(b - 1)
    else:
        hyphens.append(a + 2)

for i in range(len(word)):
    if len(hyphens) and hyphens[0] == i:
        print('-', end = '')
        hyphens.popleft()
    print(word[i], end = '')
You can optimize this code by getting rid of the auxiliary `vowels` array:

word = input()

vowels_set = set('аеёиоуыэюяАЕЁИОУЫЭЮЯ')
prev_vowel = len(word)
for i in range(len(word)):
    if word[i] in vowels_set:
        prev_vowel = i
        break

import collections
hyphens = collections.deque()
for i in range(prev_vowel + 1, len(word)):
    if word[i] in vowels_set:
        a, b = prev_vowel, i
        if b - a == 1:
            hyphens.append(b)
        elif b - a == 2:
            hyphens.append(b - 1)
        else:
            hyphens.append(a + 2)
        prev_vowel = i

for i in range(len(word)):
    if len(hyphens) and hyphens[0] == i:
        print('-', end = '')
        hyphens.popleft()
    print(word[i], end = '')

It remains only to add support letters “y”, “b” and “b”.
To do this, you need to slightly modify the chain of conditions starting with if b - a == 1::

for i ...:
    ...
    if b - a == 1:
        hyphens.append(b)
    else:
        for j in reversed(range(a + 1, b)):
            if word[j] in specials_set: # specials_set = set('йьъЙЬЪ')
                hyphens.append(j + 1)
                break
        else:
            if b - a == 2:
                hyphens.append(b - 1)
            else:
                hyphens.append(a + 2)

One more optimization (rejection of `hyphens`), and this is the result:

word = input()

vowels_set = set('аеёиоуыэюяАЕЁИОУЫЭЮЯ')
specials_set = set('йьъЙЬЪ')

prev_vowel = len(word)
for i in range(len(word)):
    if word[i] in vowels_set:
        prev_vowel = i
        break

pos = 0
for i in range(prev_vowel + 1, len(word)):
    if word[i] in vowels_set:
        a, b = prev_vowel, i
        if b - a == 1:
            print(word[pos:b], end = '-')
            pos = b
        else:
            for j in reversed(range(a + 1, b)):
                if word[j] in specials_set:
                    print(word[pos:j + 1], end = '-')
                    pos = j + 1
                    break
            else:
                if b - a == 2:
                    print(word[pos:b - 1], end = '-')
                    pos = b - 1
                else:
                    print(word[pos:a + 2], end = '-')
                    pos = a + 2
        prev_vowel = i
print(word[pos:])

In conclusion, I will answer the possible question “what is all this for?”, because there is the algorithm of P. Khristov in the modification of Dymchenko and Varsanofiev, which, moreover, applied not only for Russian, but also for English. Well, firstly, in fact, it is not suitable for English because of the peculiarities of this language. Secondly, some of the rules in it are rather dubious, for example, the rule “ghs-ssg” leads to wrong breakdown of the word “dismiss”. And thirdly, the algorithm I proposed is much faster.

PS By the way, I would be grateful if someone gives a link to the original algorithm of P. Khristov, because. I wonder what modifications Dymchenko and Varsanofiev made.

PPS Search by list of all Russian words not a very large number of words with 5 consecutive consonants were found {for example: agency, angstrom, wakefulness, intelligentsia}. In such cases (i.e. when the distance between adjacent vowels is 6), a hyphen should be inserted at positions a + 3 or [что то же самое] b − 3.
It is also possible to combine cases where the distance between vowels is 1 or 2: in both of these cases, a hyphen is inserted in positions a + 1.

The final code looks like this:

vowels_set = set('аеёиоуыэюяАЕЁИОУЫЭЮЯ')
specials_set = set('йьъЙЬЪ')

word = input()

prev_vowel = len(word)
for i in range(len(word)):
    if word[i] in vowels_set:
        prev_vowel = i
        break

pos = 0
for i in range(prev_vowel + 1, len(word)):
    if word[i] in vowels_set:
        a, b = prev_vowel, i
        for j in reversed(range(a + 1, b)):
            if word[j] in specials_set:
                npos = j + 1
                break
        else:
            if b - a <= 2:
                npos = a + 1
            elif b - a >= 6:
                npos = b - 3
            else:
                npos = a + 2
        print(word[pos:npos], end = '-')
        pos = npos
        prev_vowel = i
print(word[pos:])

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *