Five best NLP tools for working with Russian in Python

Natasha

The Natasha library has many features including tokenization, morphological analysis, lemmatization, parsing and named entity extraction.

Tokenization and segmentation of text into sentences and tokens in Natasha is carried out using the built-in Razdel library:

from razdel import tokenize, sentenize
text = "Текст для анализа."
tokens = list(tokenize(text))
sents = list(sentenize(text))

The Natasha morphological analyzer, based on the Slovnet model, allows you to extract rich morphological information about each token, such as part of speech, gender, number and case:

from natasha import MorphVocab, Doc

morph_vocab = MorphVocab()
doc = Doc(text)
doc.segment(segmenter)
doc.tag_morph(morph_tagger)

for token in doc.tokens:
    print(token.text, token.pos, token.feats)

Lemmatization in Natasha is based on the same morphological analysis and uses Pymorphy to reduce words to their initial form:

for token in doc.tokens:
    token.lemmatize(morph_vocab)
    print(token.text, token.lemma)

The Natasha parser is based on the Slovnet model and allows you to build dependency trees for sentences. For example, a parsing visualization would look like this:

doc.parse_syntax(syntax_parser)
for sent in doc.sents:
    sent.syntax.print()

There is also functionality for extracting named entities, such as names, locations and organizations:

doc.tag_ner(ner_tagger)
for span in doc.spans:
    print(span.text, span.type)

DeepPavlov

DeepPavlov includes a variety of ready-to-use NLP models, customized and optimized to solve specific problems. For example, models for text classification, named entity recognition and question answering systems based on BERT technology.

DeepPavlov models can be easily integrated and used through various interfaces, such as the command line, Python API, REST API, and even Docker.

DeepPavlov is also GPU optimized.

Examples of using

Classification of intentions helps to understand what the user wants when entering a request into the chatbot. You can use a pre-trained model to classify intent:

from deeppavlov import build_model, configs

# загрузка модели классификации намерений
intent_model = build_model(configs.classifiers.intents_snips, download=True)

# классификация намерения в запросе
intent_predictions = intent_model(['Book a flight from New York to San Francisco'])
print(intent_predictions)

Retrieving Named Entities used to identify and classify significant information elements in the text:

ner_model = build_model(configs.ner.ner_ontonotes_bert_mult, download=True)

# извлечение сущностей из текста
ner_results = ner_model(['John works at Disney in California'])
print(ner_results)

Model for answering questions allows the system to find answers to questions specified in the context:

qa_model = build_model(configs.squad.squad_bert, download=True)

# получение ответа на вопрос из предоставленного контекста
qa_result = qa_model(["What is the capital of France?", "The capital of France is Paris."])
print(qa_result)

mystem and pymorphy2

MyStem is a product from Yandex that provides capabilities for morphological and syntactic analysis of texts. It runs as a console application and is available for various operating systems (Windows, Linux, MacOS). MyStem uses proprietary algorithms to determine the initial form of a word and its grammatical characteristics

For example, basic lemmatization and morphological analysis:

from pymystem3 import Mystem

m = Mystem()
text = "Мама мыла раму каждый вечер перед сном."
lemmas = m.lemmatize(text)
print(''.join(lemmas))

analyzed = m.analyze(text)
for word_info in analyzed:
    if 'analysis' in word_info and word_info['analysis']:
        gr = word_info['analysis'][0]['gr']
        print(f"{word_info['text']} - {gr}")

PyMorphy2 also provides functions for morphological text analysis, but unlike MyStem, PyMorphy2 is completely open source and developed by the community. The library uses dictionaries OpenCorpora for analysis and can work with Russian and Ukrainian languages. PyMorphy2 also offers an API.

Examples of using:

Definition of part of speech and word forms:

import pymorphy2

morph = pymorphy2.MorphAnalyzer()
word = 'стали'
parsed_word = morph.parse(word)[0]
print(f"Нормальная форма: {parsed_word.normal_form}")
print(f"Часть речи: {parsed_word.tag.POS}")
print(f"Полное морфологическое описание: {parsed_word.tag}")

Matching a word with a number:

word = 'книга'
parsed_word = morph.parse(word)[0]
plural = parsed_word.make_agree_with_number(5).word
print(f"Множественное число слова '{word}': {plural}")

Working with different cases:

word = 'город'
parsed_word = morph.parse(word)[0]
for case in ['nomn', 'gent', 'datv', 'accs', 'ablt', 'loct']:
    form = parsed_word.inflect({case}).word
    print(f"{case}: {form}")

Let's compare these libraries in the table

Characteristic	MyStem	PyMorphy2
Source	Closed	Open
OS support	Windows, Linux, MacOS	Cross-platform
Development language	C++ (Python wrapper)	Python
Using context	Yes	No
License	Free for non-commercial use	MIT
Supported languages	Only Russian	Russian and Ukrainian
Operation speed	Faster on large texts	According to some data slower than MyStem

RuBERT

RuBERT from DeepPavlov is an adaptation of the BERT architecture, specialized for the Russian language. The model was pre-trained on a corpus that included Russian-language Wikipedia and news data. RuBERT uses the multilingual version of BERT-base as initialization.

Using RuBERT, you can implement a text classification system. For example, to determine whether the text refers to a positive or negative review:

from transformers import BertTokenizer, BertForSequenceClassification
from torch.nn.functional import softmax

# загрузка токенизатора и модели
tokenizer = BertTokenizer.from_pretrained('DeepPavlov/rubert-base-cased')
model = BertForSequenceClassification.from_pretrained('DeepPavlov/rubert-base-cased', num_labels=2)

# пример текста
text = "Этот продукт был отличным!"

# токенизация и создание входных данных для модели
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)

# получение предсказаний
outputs = model(**inputs)
predictions = softmax(outputs.logits, dim=-1)

To extract named entities, you can adapt Conversational RuBERT by training it on the corresponding NER task:

from transformers import BertTokenizer, BertForTokenClassification
import torch

tokenizer = BertTokenizer.from_pretrained('DeepPavlov/rubert-base-cased-conversational')
model = BertForTokenClassification.from_pretrained('DeepPavlov/rubert-base-cased-conversational', num_labels=9)  # предполагаем 9 классов сущностей

text = "Анна поехала в Москву на выставку."

inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)

# предсказание
with torch.no_grad():
    outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)

# сопоставление предсказаний с токенами
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
for token, prediction in zip(tokens, predictions[0]):
    print(f"{token} - {model.config.id2label[prediction.item()]}")

You can read more about libraries here:

MyStem:
– MyStem on GitHub

PyMorphy2:
– PyMorphy2 on GitHub

RuBERT and other DeepPavlov models:
– RuBERT on Hugging Face
– Conversational RuBERT on Hugging Face
– DeepPavlov on GitHub

You can immerse yourself in NLP, master various language models and create your own telegram bot on an online course from expert practitioners.