Recognizing numbers in text

Who might benefit from this article?
Perverts doing ML in Java? Or maybe for training?
Although why these excuses? All the code was written because we can.
Under the cut, we will look at how to convert numbers of the form “Twelve thousand six hundred fifty nine point four millionths” into a form like 12,659,000,004.

The Russian language has built-in aliases for some numbers. We will translate them into a sequence of ordinary numbers. To do this, let’s create a dictionary of aliases:

0 ноль нуль
1 один
2 два
3 три
4 четыре
5 пять
6 шесть
7 семь
8 восемь
9 девять
11 одиннадцать
12 двенадцать дюжина
13 тринадцать
14 четырнадцать
15 пятнадцать
16 шестнадцать
17 семнадцать
18 восемнадцать
19 девятнадцать
20 двадцать
30 тридцать
40 сорок
50 пятьдесят
60 шестьдесят
70 семьдесят
80 восемьдесят
90 девяносто
200 двести
300 триста
400 четыреста
500 пятьсот
600 шестьсот
700 семьсот
800 восемьсот
900 девятьсот
0.00000000001 стомиллиардный
0.0000000001 десятимиллиардный
0.000000001 миллиардный
0.00000001 стомиллионный
0.0000001 десятимиллионный
0.000001 миллионный
0.00001 стотысячный
0.0001 десятитысячный
0.001 тысячный
0.01 сотый
0.1 десятый
10 десять
100 сто
1000 тысяча
1000000 миллион
1000000000 миллиард
1000000000000 триллион
1000000000000000 квадриллион
1000000000000000000 квинтиллион
1000000000000000000000 секстиллион
1000000000000000000000000 септиллион
1000000000000000000000000000 октиллион

To read a dictionary from resources into memory, we need the following Kotlin code:

{}.javaClass.getResourceAsStream("/dictionary")!!
  .bufferedReader()
  .readLines()
  .flatMap { line ->
    val aliases = line.split(' ')
    val number = aliases.first().toDouble()
    aliases.drop(1).map { Pair(it, number) }
  }.toMap()

Some of the complexity of this code is explained by the theoretical possibility of having two or more aliases for one number.

Now it’s time to take the stage tokenizer and morphological dictionary.
By connecting them, we can pull out from any line a sequence of our numbers in any declensions allowed by the Russian language:

val integerPart = mutableListOf<Double>()
val fractionalPart = mutableListOf<Double>()
var currentPart = integerPart
for (token in words) {
  if (integerPart.isNotEmpty() && token.lowercase() in separators) {
    currentPart = fractionalPart
    continue
  }
  val number =
    lookupForMeanings(token)
      .run {
        firstOrNull { it.partOfSpeech == Numeral || it.partOfSpeech == OrdinalNumber }
          ?: getOrNull(0)
      }
      ?.lemma
      ?.toString()
      ?.let(numbers::get)
  if (number != null) {
    currentPart += number
    continue
  }
  if (currentPart.isNotEmpty()) {
    break
  }
}

The code is terribly mutable, but I haven’t figured out how to do it better yet. After that, we just have to glue the sequence of ordinary numbers into one. This is the simplest one, as long as the number in the sequence is less than the next, then we multiply, and when the next becomes less than the previous one, we add the islands of multiplications.

private fun List<Double>.join(): Double {
  var tokensSum = 0.0
  var previousToken = first()
  for (currToken in drop(1)) {
    if (currToken > previousToken) {
      previousToken *= currToken
    } else {
      tokensSum += previousToken
      previousToken = currToken
    }
  }
  return tokensSum + previousToken
}

It’s time to test our wonderful library!

@Test
fun parseRussianDouble() {
  assertThat("Двенадцать тысяч шестьсот пятьдесят девять целых четыре миллионных".parseRussianDouble())
    .isEqualTo(12659.000004)

  assertThat("Десять тысяч четыреста тридцать четыре".parseRussianDouble())
    .isEqualTo(10434.0)

  assertThat("Двенадцать целых шестьсот пятьдесят девять тысячных".parseRussianDouble())
    .isEqualTo(12.659)

  assertThat("Ноль целых пятьдесят восемь сотых".parseRussianDouble())
    .isEqualTo(0.58)

  assertThat("Сто тридцать пять".parseRussianDouble())
    .isEqualTo(135.0)
}

If you are wondering how to make the method .parseToRussianDouble appeared for all lines in your Kotlin (or Java) project, then you just need to connect a couple of lines in your build system:
https://jitpack.io/#demidko/chisla/2021.10.30

As a demonstration of another feature of the library, I will give a piece of code:

"Я хотел передать ему сто тридцать пять яблок".parseRussianDouble()
// 135

The source code of the library is available on GitHub: https://github.com/demidko/chisla
Criticism, questions, suggestions are accepted in issues or in the comments under the article.

Only registered users can participate in the survey. Come in, please.

Healthy?

60%
Yes, I can use it in my project

6

40%
No, I will use ICU / another bike

4

10 users have voted. 9 users abstained.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *