from Hello to Donaudampfschifffahrtsgesellschaftskapitän

Tokenization issues in different languages

  • English: It seems simple, but here questions arise: what to do with punctuation? How to handle abbreviations? And what to do with words like “isn't”?

  • Chinese: Here tokenization becomes a guessing game, because the characters are consecutive without spaces. How to separate words if there are no spaces?

  • German: I've come across words like Donaudampfschifffahrtsgesellschaftskapitän? This is a word of 40+ letters – and it needs to be cut into parts somehow correctly.

  • Russian: Problems of cases, declensions and prefixes. The word “machine” can be “mashin”, “mashini”, “mashine” – and each form should be included in the search results.

Now let's figure out how to solve this using analyzers and filters in Elasticsearch.

Elasticsearch uses analyzers for tokenization and text processing. These analyzers are not magical creatures that know all the languages ​​of the world, but a set of filters that are applied to the text in a chain. Each language has its own nuances, and the right choice of analyzer is the key to success.

Tokenization for English text

Let's start with something simple. For English, you can use the basic one. Standard Analyzerwhich splits text into words, ignoring punctuation and converting everything to lowercase. But sometimes that's not enough.

Example of analyzer setup:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "english_custom": {
          "type": "standard",
          "stopwords": "_english_",
          "filter": ["lowercase", "porter_stem"]
        }
      }
    }
  }
}

Here we have added porter_stem a filter that reduces words to their stems (e.g. “running“will become”run“).

Tokenization for Chinese text

There are no spaces in Chinese, and each character can be a stand-alone word or part of a larger compound word. For Chinese text, Elasticsearch suggests using Smart Chinese Analyzer.

Example:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "chinese": {
          "type": "smartcn"
        }
      }
    }
  }
}

This analyzer uses built-in dictionaries to tokenize text. Smart Chinese Analyzer automatically splits hieroglyphs into words.

Tokenization for German text

Now German. Tokenization in German can be a real problem because of compound words. A classic example: Donaudampfschifffahrtsgesellschaftskapitän. To work with such words you can use German Analyzerwhich includes filters for stemming and processing complex words.

Example of setup:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "german_custom": {
          "type": "standard",
          "stopwords": "_german_",
          "filter": ["lowercase", "german_normalization", "german_stem"]
        }
      }
    }
  }
}

Here we have german_normalizationwhich handles language features such as umlauts (ä, ö, ü), and german_stemwhich shortens words to their roots while preserving the meaning.

Tokenization for Russian text

For the Russian language, problems, as I have already mentioned, arise with cases and declensions. Here we are helped by Russian Analyzerwhich can process the morphology of the Russian language.

Example:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "russian_custom": {
          "type": "standard",
          "stopwords": "_russian_",
          "filter": ["lowercase", "russian_morphology", "russian_stop"]
        }
      }
    }
  }
}

Here is the filter russian_morphology works with cases and word forms, and russian_stop filters out unnecessary words like “and”, “in”, “on”.

How Not to Screw Up with Tokenization

Even if you think you've chosen the right analyzer, always test it on real text. Some analyzers may work great on some data and terribly on others.

Sometimes you will have to create your own chains of analyzers by combining several filters. For example, if the standard analyzer for Russian fails, you can add a custom filter to process some specific words.

Don't forget about language peculiarities. For example, in German umlauts can be replaced with vowels without dots, and in Russian it is necessary to take into account that the letters “ё” and “е” can be interchangeable in search.

Stemming and Lemmatization

Now let's talk about how not to lose the meaning in multilingual search when there are words like “go“, “is coming“, “was walking“, and they should all be understood by Elasticsearch as the same concept. This is where stemming And lemmatization — two powerful tools that help the search engine make text comparison smarter.

Let's look at the terms:

  • Stemming — is a process that cuts off the endings of words, leaving their stem. For example, “playing“, “played” And “plays“will be reduced to”play“The method is crude, but fast.

  • Lemmatization — a more intellectual process. Lemmatization reduces a word to its base form (lemma) taking into account the grammatical context. For example, the words “better” And “good“will be reduced to one lemma – “good“A more difficult task, but the result is more accurate.

Lemmatization and stemming work differently in different languages. What works for English may not work for Russian or German. The main problem is grammatical differences between languages. For example:

  • In Russian the word “go” can change in tense, number and case.

  • In German, compound words are combined into one line.

  • There is no morphology in Chinese, but contextual lexical meanings are important.

Stemming and lemmatization in Elasticsearch

Elasticsearch has built-in parsers and filters for stemming and lemmatization in different languages. =

Stemming for English

For English, stemming can be configured using Porter Stemming Algorithm – it transforms words to their basic forms.

Example of setup:

{
  "settings": {
    "analysis": {
      "filter": {
        "english_stemmer": {
          "type": "stemmer",
          "language": "english"
        }
      },
      "analyzer": {
        "english_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "english_stemmer"
          ]
        }
      }
    }
  }
}

This example will set up a stemming filter for English. When you send text like “running“, “ran“, or “runs“, the analyzer reduces all these words to “run“.

Stemming for German

For German there is German Normalization Filter And German Light Stemmer.

Example of setup:

{
  "settings": {
    "analysis": {
      "filter": {
        "german_normalization": {
          "type": "german_normalization"
        },
        "german_stemmer": {
          "type": "stemmer",
          "language": "light_german"
        }
      },
      "analyzer": {
        "german_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "german_normalization",
            "german_stemmer"
          ]
        }
      }
    }
  }
}

Filter german_normalization processes letters with umlauts and replaces ß with “ss“, which improves the correctness of tokenization.

Lemmatization for Russian language

With Russian it is more complicated: due to the developed morphology, stemming does not always give accurate results. Therefore, for Russian it is often better to use lemmatizationIn Elasticsearch, you can use a special morphology filter for the Russian language.

Example of setup:

{
  "settings": {
    "analysis": {
      "filter": {
        "russian_morphology": {
          "type": "russian_morphology"
        }
      },
      "analyzer": {
        "russian_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "russian_morphology"
          ]
        }
      }
    }
  }
}

Here we use russian_morphology — this filter processes different forms of words, reducing them to one base. For example, “go“, “is coming” And “was walking” will be considered the same word.

Lemmatization for Spanish

The Spanish language has its own peculiarities. It also uses a lemmatizer, which brings verbs and nouns to their normal form.

Example of setup:

{
  "settings": {
    "analysis": {
      "filter": {
        "russian_stop": {
          "type": "stop",
          "stopwords": "_russian_"
        }
      },
      "analyzer": {
        "russian_analyzer_with_stop": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "russian_morphology",
            "russian_stop"
          ]
        }
      }
    }
  }
}

Here we use a lightweight stemmer for Spanish that handles inflectional forms such as “hablando” (speaking) and “hablar” (speak).

Setting stop words for multilingual search

Stop words are words that do not carry semantic load for searching (for example, prepositions and conjunctions). Different languages ​​have their own stop words. Elasticsearch has built-in lists of stop words for popular languages, which you can customize as you wish.

Example of setting up an analyzer with stop words:

{
  "settings": {
    "analysis": {
      "filter": {
        "russian_stop": {
          "type": "stop",
          "stopwords": "_russian_"
        }
      },
      "analyzer": {
        "russian_analyzer_with_stop": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "russian_morphology",
            "russian_stop"
          ]
        }
      }
    }
  }
}

Here is the filter russian_stop automatically removes Russian prepositions and other frequently occurring words from the text that are not needed for indexing.

So, if you want to make multilingual search smarter, setting up stemming and lemmatization will be the first step on this path.

Annotation and Metadata

Now let's talk about something that many people underestimate when working with Elasticsearch, especially in multilingual projects – this is document annotation and metadata managementProper annotation and metadata management not only aids search, they ensure relevance and speed up access to information, especially in the context of multilingual projects.

If your users are from different regions and speak different languages, you must take this into account when indexing and setting up your search. Let's say your search must process documents in Russian, English, and Chinese. For each language, you must have your own tokenization settings, filters, and most importantly, language identification. Without proper work with metadata, you will either lose relevance or overload the search with unnecessary information.

Annotation — is the process of adding metadata to documents that helps Elasticsearch process and index them correctly. It's like attaching a label to each document so Elasticsearch knows how to handle it. Metadata may include information about language, content type, creation date, and more.

_lang field: how to determine the language of a document

The first thing you encounter when creating a multilingual search is the need to determine the language of the document. For this, you can use a special field _langwhich helps Elasticsearch choose the right analyzers and filters for tokenization and stemming.

Example of data structure with field _lang:

{
  "title": "Добро пожаловать",
  "body": "Это пример русского текста.",
  "_lang": "ru"
}

Here we explicitly indicate that the document language is Russian. This is important because Elasticsearch uses its own tokenizers, stemmers, and filters for each language.

Let's look at how you can set up an index that takes into account language annotation. Let's say you have data in several languages, and you want Elasticsearch to use different analyzers depending on the field value _lang.

Example of index configuration:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"]
        },
        "english_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "english_stemmer"]
        },
        "russian_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "russian_morphology"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "english": {
            "type": "text",
            "analyzer": "english_analyzer"
          },
          "russian": {
            "type": "text",
            "analyzer": "russian_analyzer"
          }
        }
      },
      "body": {
        "type": "text",
        "fields": {
          "english": {
            "type": "text",
            "analyzer": "english_analyzer"
          },
          "russian": {
            "type": "text",
            "analyzer": "russian_analyzer"
          }
        }
      },
      "_lang": {
        "type": "keyword"
      }
    }
  }
}

Here we set different analyzers for English and Russian languages. Field _lang is used as a key to determine the language of the document, which helps Elasticsearch automatically apply the correct parser.

How to work with metadata

The field itself _lang — this is just one example. You can extend annotation by adding metadata about the content type, document version, and creation time. This allows you to build more complex queries and filter data based on these parameters.

Here are some metadata that will help with multilingual search:

  1. _lang (language) — as we have already said, this is a mandatory field for multilingual projects.

  2. version (version) — useful for searching through document versions. For example, if you store drafts and final versions.

  3. timestamp (creation time) — important for filtering documents by date, especially when you need to search for the latest materials in a given language.

Example of a structure with additional metadata:

{
  "title": "Welcome",
  "body": "This is an example of English text.",
  "_lang": "en",
  "version": "1.0",
  "timestamp": "2024-09-20T12:00:00"
}

Now let's look at how to properly configure an Elasticsearch index to work with metadata. Let's say there are documents in different languages, and you want the search to be fast and efficient, and the relevance of the results to be maintained at a high level.

Example of index setup:

{
  "settings": {
    "index": {
      "number_of_shards": 3,
      "number_of_replicas": 1
    },
    "analysis": {
      "analyzer": {
        "default": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text"
      },
      "body": {
        "type": "text"
      },
      "_lang": {
        "type": "keyword"
      },
      "version": {
        "type": "keyword"
      },
      "timestamp": {
        "type": "date",
        "format": "yyyy-MM-dd'T'HH:mm:ss"
      }
    }
  }
}

Here we indicate that the field _lang should be indexed as keyword (exact meaning).

Now that documents contain metadata, you can use it for filtering and sorting. For example, you want to get all Russian-language documents sorted by creation date:

Example request:

{
  "query": {
    "bool": {
      "must": [
        { "match": { "_lang": "ru" } }
      ]
    }
  },
  "sort": [
    { "timestamp": { "order": "desc" } }
  ]
}

This query will return all documents in Russian, sorted by creation time (from newest to oldest).

So don't be lazy in setting up metadata correctly – this is the case when the efforts invested pay off with high-quality and fast search.


Conclusion

In this article, we discussed how to properly configure analyzers for different languages, use metadata for filtering, and optimize the document processing process. Remember that search is It's not just indexing data, it's creating a comfortable interface between the user and the information.

I hope this guide will help you make this process more convenient.

Our colleagues from OTUS tell us more about application infrastructure in practical online courses. You can see the full list of courses here see the catalogue.

Course Catalogue

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *