Yandex opens the largest Russian-language dataset of reviews of organizations

Today we want to share news for everyone involved in data analysis in the field of linguistics and machine learning. Yandex makes it publicly available the largest Russian-language review dataset about organizations published on Yandex Maps. This is 500 thousand reviews from all over Russia from January to July 2023.

In this article, I will tell you why reviews are useful from a research point of view, what is special about this dataset, and also show examples of problems that can be solved with its help.

The big role of reviews in product creation

On Yandex services, users leave a lot of ratings and reviews in a variety of areas: from goods and services to entertainment. This makes it easier for people to choose, and allows us to better understand their needs and preferences, improve products and develop new features.

The Yandex geoservices team processes a large amount of content from users about organizations: reviews, ratings, photos and even videos. This allows you to solve at least two key tasks for Maps users: searching and selecting places.

To find relevant results for the query “Where to eat delicious borscht” we need to know a lot of information about organizations nearby. Part of it is stored in a directory – a knowledge catalog that contains factual data about the organization (address, type of organization, operating hours, and much more). But in order to give a truly good answer, we must also understand that it is the borscht that visitors rate well in the establishment. We can answer this question thanks to reviews and ratings from restaurant visitors, analyzing their content and highlighting keywords.

The second aspect, location selection, is related to displaying the most useful information in the context of the current task on the organization card. Several points can be highlighted here.

Firstly, it is important for the user to understand why this particular organization was displayed in response to his request. This problem can be solved in different ways, for example, using summations, which are extracts from user reviews.

An example of a review from which we understood that the establishment praises borscht

An example of a review from which we understood that the establishment praises borscht

Secondly, we can display aspects or features that most users consider important for this type of organization, and give the user a more detailed assessment of the establishment and more information for making a decision.

Examples of aspects (food, staff, service) and the percentage of users who rated them good.  Such aspects can also be highlighted from reviews

Examples of aspects (food, staff, service) and the percentage of users who rated them good. Such aspects can also be highlighted from reviews

These are just a couple of examples of how user-shared data can help. Data helps to understand their needs, organizational characteristics and locations.

Features of the organization dataset

So, our dataset consists of 500 thousand records collected from January to June 2023. The dataset includes the address and name of the organization, a list of categories (for example, cafe, restaurant), user rating and review text. The dataset is also cleared of personal data that users may have accidentally left.

An example of working with dataset data

The most obvious application of a review dataset may be sentiment analysis (a statistical approach to determining the emotional coloring of a text). Linguistic analysis of reviews helps us understand how people talk about different types of organizations, what words, phrases or language constructs they use in their reviews, and how this varies across different geographic contexts.

As an example, let’s look at how using this data and open text mining libraries can yield characteristics or aspects that users consider important when evaluating organizations. That is, an analogue of what we saw above in the example of Yandex Maps.

For this we will use:

  • library PyMorphy for morphological analysis of text, lemmatization and separation of parts of speech;

  • TF‑IDF method analysis implemented in the module SciPy libraries to highlight important words.

import pymorphy2
import re
import pandas as pd
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer

From the point of view of morphological analysis, aspects (food, service, atmosphere) are more like nouns in Russian. But if we are talking about coloring and evaluating these aspects, they will be expressed, as a rule, using adjectives. Therefore, first of all, we will start by processing the texts of the dataset and highlighting the parts of speech that interest us.

#Инициализируем экземпляр MorphAnalyzer
morph = pymorphy2.MorphAnalyzer(lang='ru')

#Скачаем словарь стоп-слов, который потребуется в функции лемматизации текста
nltk.download('stopwords')
stops = nltk.corpus.stopwords.words('russian')

# Можем также дополнить датасет дополнительными стоп-словами
stops.extend(['что', 'это', 'так', 'вот', 'быть', 'как', 'в', 'к', 'на', 'руб', 'мой', 'твой', 'его', 'её', 'наш', 'ваш', 'их', 'свой', 'еще', 'очень', 'поэтому', 'однако', 'конечно'])
unique_stops = set(stops)

#Объявим функцию для лемматизации текста и выделения частей речи
def extract_nouns(text):
    nouns = []

    #Очищаем текст от лишнего
    clean_text = re.sub(r'\s+', ' ', re.sub(r'[\d\W]', ' ', text))

    # Разбиваем текст на слова
    words = clean_text.split()

    for word in words:
        parsed_word = morph.parse(word)[0]

        # Приводим слово к нормальной форме
        normalized_word = parsed_word.normal_form
        if normalized_word not in unique_stops:

            # Определяем часть речи слова
            pos = parsed_word.tag.POS
            case = parsed_word.tag.case
            anim = parsed_word.tag.animacy

            # Выделяем существительные, но отфильтровываем имена собственные
            if pos == 'NOUN' and not (case == 'nomn' and anim == 'anim'):
                nouns.append(normalized_word)

    return ' '.join(nouns)

# Добавим в датасет колонку с обработанным текстом
df = pd.read_csv('reviews.csv')
df['aspects'] = df['text'].apply(extract_nouns)

Now we can identify keywords for each category using TF‑IDF‑analysis.

#Объявим функцию для обработки текстов и сохранения результата анализа
def find_top_words_by_rubric(vectorizer):

    result = {
        'rubrics': [],
        'words': [],
        'reviews': [],
        'scores': []
    }

    #Проходимся по рубрикам
    for rubric in df_flattened['rubrics'].unique():
        texts = df_flattened[df_flattened['rubrics'] == rubric]['aspects']
        total_count = texts.shape[0]

        # В анализ возьмём только те рубрики, у которых есть несколько текстов
        if total_count >= 5:
            tfidf_matrix = vectorizer.fit_transform(texts)
        else:
            continue

        result['rubrics'].append(rubric)
        result['reviews'].append(total_count)
        feature_names = vectorizer.get_feature_names()
        tfidf_scores = tfidf_matrix.max(axis=0).toarray().ravel()

        # Возьмём топ-20 слов для каждой рубрики
        top_words_indices = tfidf_scores.argsort()[-20:][::-1]
        top_words = [feature_names[i] for i in top_words_indices]
        result['words'].append(', '.join(top_words))
        top_scores = [str(tfidf_scores[i]) for i in top_words_indices]
        result['scores'].append(', '.join(top_scores))

    return result

# Развернём датасет по рубрикам, так как одна организация может принадлежать к списку рубрик
df['rubrics'] = df['rubrics'].apply(lambda x: x.split(";"))
df_flattened = df.explode('rubrics')

# Инициализируем TF-IDF-векторизатор
aspects_vectorizer = TfidfVectorizer(use_idf = True, max_df = 0.8, min_df = 0.1)

# Создадим датафрейм с результатами анализа
tf_idf_aspects = pd.DataFrame(find_top_words_by_rubric(aspects_vectorizer)).sort_values(by='reviews', ascending=False)

As a result, we see that words that characterize the rubric well are highlighted. For example, “assortment” is allocated for stores, “atmosphere” is allocated for restaurants, and “master” is allocated for categories related to the provision of services.

An example of outputting the results of TF-IDF analysis of review texts by category

An example of outputting the results of TF-IDF analysis of review texts by category

Having received the important words for the headings, we can further deepen our research using additional data from the dataset:

  • Evaluate which words correlate with negative or positive user ratings by breaking down texts by rating.

  • Analyze other parts of speech: for example, understand which adjectives are associated with positive or negative reviews of organizations.

  • Go to bigram analysis and try to understand how individual aspects are evaluated.

  • Find out if there is any specificity in the description of organizations in the context of geography.


What other knowledge can be obtained from reviews, read the recent Yandex study “How restaurants and bars are praised and criticized

We hope that our dataset will be useful to the community for conducting academic research related to text analysis in the context of reviews and geography. We will be glad to receive feedback: share with us your research ideas or how else this dataset can be useful.

To learn more about the dataset and start using it, go to our repository on github.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *