The simplest algorithm for breaking a word into syllables
The algorithm turned out like this (I give almost the same Python code that I wrote in pencil then in class).
slog_start = 0
i = 0
while i < len(word):
if word[i] in vowels_set:
vowel_pos = i
i += 1
while i < len(word):
if word[i] in vowels_set:
if i - vowel_pos == 1:
hyphens.append(i)
elif i - vowel_pos == 2:
hyphens.append(i - 1)
else:
hyphens.append(vowel_pos + 2)
At this point, it dawned on me: it’s enough just to know the distance between adjacent vowels (let them be in positions a And b) – if it is equal to 1, then insert transfer in position bif equal to 2, then in position b − 1, otherwise [т.е. когда расстояние больше 2] in position a + 2.
It turns out this Python code:
word = input()
vowels_set = set('аеёиоуыэюяАЕЁИОУЫЭЮЯ')
vowels = []
for i in range(len(word)):
if word[i] in vowels_set:
vowels.append(i)
import collections
hyphens = collections.deque()
for i in range(1, len(vowels)):
a, b = vowels[i-1], vowels[i]
if b - a == 1:
hyphens.append(b)
elif b - a == 2:
hyphens.append(b - 1)
else:
hyphens.append(a + 2)
for i in range(len(word)):
if len(hyphens) and hyphens[0] == i:
print('-', end = '')
hyphens.popleft()
print(word[i], end = '')
word = input()
vowels_set = set('аеёиоуыэюяАЕЁИОУЫЭЮЯ')
prev_vowel = len(word)
for i in range(len(word)):
if word[i] in vowels_set:
prev_vowel = i
break
import collections
hyphens = collections.deque()
for i in range(prev_vowel + 1, len(word)):
if word[i] in vowels_set:
a, b = prev_vowel, i
if b - a == 1:
hyphens.append(b)
elif b - a == 2:
hyphens.append(b - 1)
else:
hyphens.append(a + 2)
prev_vowel = i
for i in range(len(word)):
if len(hyphens) and hyphens[0] == i:
print('-', end = '')
hyphens.popleft()
print(word[i], end = '')
It remains only to add support letters “y”, “b” and “b”.
To do this, you need to slightly modify the chain of conditions starting with if b - a == 1:
:
for i ...:
...
if b - a == 1:
hyphens.append(b)
else:
for j in reversed(range(a + 1, b)):
if word[j] in specials_set: # specials_set = set('йьъЙЬЪ')
hyphens.append(j + 1)
break
else:
if b - a == 2:
hyphens.append(b - 1)
else:
hyphens.append(a + 2)
word = input()
vowels_set = set('аеёиоуыэюяАЕЁИОУЫЭЮЯ')
specials_set = set('йьъЙЬЪ')
prev_vowel = len(word)
for i in range(len(word)):
if word[i] in vowels_set:
prev_vowel = i
break
pos = 0
for i in range(prev_vowel + 1, len(word)):
if word[i] in vowels_set:
a, b = prev_vowel, i
if b - a == 1:
print(word[pos:b], end = '-')
pos = b
else:
for j in reversed(range(a + 1, b)):
if word[j] in specials_set:
print(word[pos:j + 1], end = '-')
pos = j + 1
break
else:
if b - a == 2:
print(word[pos:b - 1], end = '-')
pos = b - 1
else:
print(word[pos:a + 2], end = '-')
pos = a + 2
prev_vowel = i
print(word[pos:])
In conclusion, I will answer the possible question “what is all this for?”, because there is the algorithm of P. Khristov in the modification of Dymchenko and Varsanofiev, which, moreover, applied not only for Russian, but also for English. Well, firstly, in fact, it is not suitable for English because of the peculiarities of this language. Secondly, some of the rules in it are rather dubious, for example, the rule “ghs-ssg” leads to wrong breakdown of the word “dismiss”. And thirdly, the algorithm I proposed is much faster.
PS By the way, I would be grateful if someone gives a link to the original algorithm of P. Khristov, because. I wonder what modifications Dymchenko and Varsanofiev made.
PPS Search by list of all Russian words not a very large number of words with 5 consecutive consonants were found {for example: agency, angstrom, wakefulness, intelligentsia}. In such cases (i.e. when the distance between adjacent vowels is 6), a hyphen should be inserted at positions a + 3 or [что то же самое] b − 3.
It is also possible to combine cases where the distance between vowels is 1 or 2: in both of these cases, a hyphen is inserted in positions a + 1.
vowels_set = set('аеёиоуыэюяАЕЁИОУЫЭЮЯ')
specials_set = set('йьъЙЬЪ')
word = input()
prev_vowel = len(word)
for i in range(len(word)):
if word[i] in vowels_set:
prev_vowel = i
break
pos = 0
for i in range(prev_vowel + 1, len(word)):
if word[i] in vowels_set:
a, b = prev_vowel, i
for j in reversed(range(a + 1, b)):
if word[j] in specials_set:
npos = j + 1
break
else:
if b - a <= 2:
npos = a + 1
elif b - a >= 6:
npos = b - 3
else:
npos = a + 2
print(word[pos:npos], end = '-')
pos = npos
prev_vowel = i
print(word[pos:])