Determining the genre of a film by description

Author of the article: Oleg Blokhin

OTUS graduate

While searching for a project topic that was supposed to end the course Machine learning. Professional, I decided to experiment with data on films, cartoons, TV series and other similar products. A little regretting that I have almost no time to watch films, let’s get started.

Data collection

Let's try to take data (movie descriptions) from the site Kinopoiskand then, based on the description of the film, determine the genre of the film.

The structure of the address bar of the page with a list of films turned out to be trivial:

And the movie page looks good too:

With a little effort, a simple data collection algorithm was written.

Source
import numpy as np   # Библиотека для матриц, векторов и линала
import pandas as pd  # Библиотека для табличек
import time          # Библиотека для времени

from selenium import webdriver 

browser = webdriver.Firefox()
time.sleep(0.3)
browser.implicitly_wait(0.3) 

from bs4 import BeautifulSoup
from lxml import etree
from tqdm.notebook import tqdm

def get_dom(page_link):
    browser.get(page_link)
    html = browser.page_source
    soup = BeautifulSoup(html, 'html.parser') 
    return etree.HTML(str(soup)) 

def get_listpage_links(page_no):
    # page_link = f'https://www.kinopoisk.ru/lists/movies/year--2010-2019/?page={page_no}'
    page_link = f'https://www.kinopoisk.ru/lists/movies/year--2021/?page={page_no}'
    dom = get_dom(page_link) 
    return dom.xpath("body//div[@data-tid='8a6cbb06']/a[@href]/@href")

def get_moviepage_info(movie_link):
    page_link = 'https://www.kinopoisk.ru' + movie_link
    dom = get_dom(page_link)

    elem = dom.xpath("body//div/span//span[@data-tid='939058a8']")
    rating = elem[0].text if elem else ''
    elem = dom.xpath("body//div//h1[@data-tid='f22e0093']/span")
    name = elem[0].text if elem else ''

    features = {}
    elem = dom.xpath("body//div/div[@data-test-id='encyclopedic-table' and @data-tid='bd126b5e']")[0]
    for child in elem.getchildren():
        #print(etree.tostring(child))
        feature = child.xpath('div[1]')[0].text
        ahrefs = child.xpath('div[position()>1]//a[text()] | div[position()>1]//div[text() and not(*)] | div[position()>1]//span[text()]')
        values = [ahr.text for ahr in ahrefs]
        features[feature] = values

    elem = dom.xpath("body//div/p[text() and not(*) and @data-tid='bfd38da2']")
    short_descr = elem[0].text if elem else ''

    elem = dom.xpath("body//div/p[text() and not(*) and @data-tid='bbb11238']")
    descr=" ".join([x.text for x in elem])

    return (name, rating, short_descr, descr, features)

_df = pd.DataFrame(columns=['id', 'type', 'name', 'rating', 'short_descr', 'descr', 'features'])
for page_number in tqdm(range(1, 912), desc="List pages"):
    try:
        links = get_listpage_links(page_number)
        _df = pd.DataFrame(columns=['id', 'type', 'name', 'rating', 'short_descr', 'descr', 'features'])
        for movie_link in links:
          movie_id = movie_link.split('/')[1:3]
          name, rating, short_descr, descr, features = get_moviepage_info(movie_link)
          data_row = {'id':movie_id[1], 'type':movie_id[0], 'name':name, 'rating':rating, 'short_descr':short_descr, 'descr':descr, 'features': features}
          _df = pd.concat([_df, pd.DataFrame([data_row])], ignore_index=True)
        with open('kinopoisk_2010-2019.csv', 'a') as f:
            _df.to_csv(f, mode="a", header=f.tell()==0, index=False)
    except Exception as err:
        print(f"Unexpected {err=}, {type(err)=}")

The result of the algorithm’s work (to be honest, I was just tired of waiting) was over 50 thousand records about films. For our research, we only need a description and a list of genres associated with films.

Data set

Data set

Looking at the number of representatives of each of the available genres, it becomes clear that we have an extremely unbalanced sample by class:

alt

Let's get rid of those genres where the number of representatives is less than 100, and at the same time we will remove the incomprehensible genre “–“. As a result of this filtering, 26 genres and more than 53 thousand films remained for experimentation. Should be enough 🙂

alt

Initially, I used multi-class classification algorithms, where each entry is assigned a single class label. With this approach, the metric values ​​turned out to be quite modest (which is intuitively understandable; often even a human viewer is not easy to understand what is more in the film – drama, comedy or even melodrama), so I will not waste the readers’ time on this. In addition, from my point of view, the genres presented on the above-mentioned resource, in many cases, are not classes in the classical mathematical understanding of the classification problem: an action movie can be presented in cartoon form (and “cartoon” and “action movie” are different genres in the annotation kinopoisk), and anime is always a cartoon (may fans of this genre forgive me my ignorance if I’m wrong :)). In general, let's take it for granted that the model where each film is characterized by belonging to only one genre oversimplifies reality.

We will solve the problem of multilabel classification, when one film is assigned one or more labels of different genres.

Technically, preparing data for solving this problem turned out to be quite simple using the sklearn library. In our case, a genre_multi column was created in the DataFrame (pandas.DataFrame), which contained comma-separated genre names (for example, “drama, crime, biography, comedy”). The following code adds columns whose names match the name of the genre-class, and contain zeros or ones, depending on whether a specific genre is specified for the picture or not.

Source

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb_result = mlb.fit_transform([str(data.loc[i,'genre_multi']).split(',') for i in range(len(data))])
data = pd.concat([data, pd.DataFrame(mlb_result, columns = list(mlb.classes_))], axis=1)

target_strings = mlb.classes_

The output of this code looks something like this:

name

descr

lemmatized_descr

genre_multi

anime

biography

action movie

Western

military

detective

musical

adventures

real TV

family

sport

talk show

thriller

horror

fantastic

fantasy

1+1 (2011)

After being injured in an accident, God…

suffer result accident rich…

drama, comedy, biography

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Gentlemen (2019)

One cunning American from his student years…

cunning American student year bargaining…

crime, comedy, action

0

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

The Wolf of Wall Street (2013)

1987 Jordan Belfort becomes a broker…

Jordan Belfort became a successful broker…

drama, crime, biography, comedy

0

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Dividing data into training and test sets

One of the first difficulties that inevitably arises when trying to divide data into training and test samples is the huge number of “label” options: with multilabel classification, we are trying to predict not just a class label, but a vector of zeros and ones of length N, where N is the number genres in our case. In our case, the number of theoretically possible outcomes is 226which is many times larger than the size of all our data.
The standard train_test_split method with the stratify option from sklearn.model_selection, as expected, did not cope with this task. A search on the World Wide Web suggested the following option, based on an article from 2011:

Source
from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
from sklearn.utils import indexable, _safe_indexing
from sklearn.utils.validation import _num_samples
from sklearn.model_selection._split import _validate_shuffle_split
from itertools import chain

def multilabel_train_test_split(*arrays,
                                test_size=None,
                                train_size=None,
                                random_state=None,
                                shuffle=True,
                                stratify=None):
    """
    Train test split for multilabel classification. Uses the algorithm from: 
    'Sechidis K., Tsoumakas G., Vlahavas I. (2011) On the Stratification of Multi-Label Data'.
    """
    if stratify is None:
        return train_test_split(*arrays, test_size=test_size,train_size=train_size,
                                random_state=random_state, stratify=None, shuffle=shuffle)
    
    assert shuffle, "Stratified train/test split is not implemented for shuffle=False"
    
    n_arrays = len(arrays)
    arrays = indexable(*arrays)
    n_samples = _num_samples(arrays[0])
    n_train, n_test = _validate_shuffle_split(
        n_samples, test_size, train_size, default_test_size=0.25
    )
    cv = MultilabelStratifiedShuffleSplit(test_size=n_test, train_size=n_train, random_state=random_state)
    train, test = next(cv.split(X=arrays[0], y=stratify))

    return list(
        chain.from_iterable(
            (_safe_indexing(a, train), _safe_indexing(a, test)) for a in arrays
        )
    )

All subsequent reports on classification results will be based on the same test sample, the size of which is 20% of all available data.

Classic methods

Classical methods work with pre-processed data. The standard lemmatization technique was used, the only difference was the addition of the stop word “film”.

Source
import nltk
nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('russian')
stop_words.append('фильм')
#word_tokenizer = nltk.WordPunctTokenizer()

import re
regex = re.compile(r'[А-Яа-яA-zёЁ-]+')

def words_only(text, regex=regex):
    try:
        return " ".join(regex.findall(text)).lower()
    except:
        return ""

from pymystem3 import Mystem
from string import punctuation

mystem = Mystem() 

#Preprocess function
def preprocess_text(text):
    text = words_only(text)
    tokens = mystem.lemmatize(text.lower())
    tokens = [token for token in tokens if token not in stop_words\
              and token != " " \
              and token.strip() not in punctuation]
    
    text = " ".join(tokens)
    
    return text

Logistic regression and TF-IDF

To begin with, a logistic regression model was trained, while text vectorization was carried out using the method TF-IDF. Multilabel classification is achieved by “wrapping” the standard model from sklearn into the standard one MultiOutputClassifier from the same sklearn library. Combining all these components into a single pipeline made it possible to select hyperparameters simultaneously for both the vectorizer and the logistic regression model itself. Comfortable!

Source
pipe = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1700, min_df=0.0011, max_df=0.35, norm='l2')),
    ('logregr', MultiOutputClassifier(estimator= LogisticRegression(max_iter=10000, class_weight="balanced", multi_class="multinomial", C=0.009, penalty='l2'))),
])

pipe.fit(train_texts, train_y)

pred_y = pipe.predict(test_texts)
print(classification_report(y_true=test_y, y_pred=pred_y, target_names=target_strings))

The classification report is shown below:

alt

alt

Looking ahead a little, the results of the recall metric for this model turned out to be the highest among all the models that took part in the experiment.
And in general, it is obvious that the results of determining the genre by the model are much better than just a random choice.

Catboost+TF-IDF

Let's do the same with Catboost. Although Catboost itself “can do multilabel classification,” we will go a different way (why we are doing this will become clear a little later): in the same way, we will “wrap” CatBoostClassifier in MultiOutputClassifier. At the same time, let's see how multilabel classification works in the Catboost implementation. Looking ahead: the classification results differed little, but with MultiOutputClassifier the algorithm worked on the CPU in 89 minutes versus 150 minutes using Catboost multilabel classification.

Source code with MultiOutputClassifier
from catboost import CatBoostClassifier

pipe2 = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1700, min_df=0.0031, max_df=0.4, norm='l2')),
    ('gboost', MultiOutputClassifier(estimator= CatBoostClassifier(task_type="CPU", verbose=False))),
])

pipe2.fit(train_texts, train_y)

pred_y = pipe2.predict(test_texts)
print(classification_report(y_true=test_y, y_pred=pred_y, target_names=target_strings))
Source code without MultiOutputClassifier
pipe3 = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1700, min_df=0.0031, max_df=0.4, norm='l2')),
    ('gboost', CatBoostClassifier(task_type="CPU", loss_function='MultiLogloss',  class_names=target_strings, verbose=False)),
])

pipe3.fit(train_texts, train_y)

pred_y = pipe3.predict(test_texts)
print(classification_report(y_true=test_y, y_pred=pred_y, target_names=target_strings))

Catboost classification results with MultiOutputClassifier :

alt

alt

Catboost classification results without MultiOutputClassifier :

alt

alt

You can notice that Catboost has a bias towards the precision metric, and the recall metric is much inferior to the results of logistic regression.

Examples of characteristic words

Now it’s time to explain why I needed MultiOutputClassifier even for gradient boosting: in this way, it is possible to extract genre-specific words from the model. That's what we'll do now. And let's look at the results in the form of word clouds 🙂

Source
import matplotlib.pyplot as plt
from wordcloud import WordCloud

def gen_wordcloud(words, importances):
    d = {}
    for i, word in enumerate(words):
        d[word] = abs(importances[i])

    wordcloud = WordCloud()
    wordcloud.generate_from_frequencies(frequencies=d)

    return wordcloud

for idx, x in enumerate(target_strings):
    c1 = pipe['logregr'].estimators_[idx].coef_[0]
    words1 = pipe['tfidf'].get_feature_names_out()
    wc1 = gen_wordcloud(words1, c1)

    c2 = pipe2['gboost'].estimators_[idx].feature_importances_
    words2 = pipe2['tfidf'].get_feature_names_out()
    wc2 = gen_wordcloud(words2, c2)

    f, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 10))

    ax1.imshow(wc1, interpolation="bilinear")
    ax1.set_title(f'Log regr - {x}')
    ax1.axis('off')
    ax2.imshow(wc2, interpolation="bilinear")
    ax2.set_title(f'Catboost - {x}')
    ax2.axis('off')

    plt.tight_layout()
alt

alt

alt

alt

alt

alt

The remaining 23 genres
alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

In my subjective opinion, the most striking difference between logistic regression and Catboost are the characteristic words for the “drama” genre, which is most richly represented in our data.

This is where we will finish the experiments with classical models; we will return to them only briefly at the end of the article, when we compare their results with the results of transformer models.

Transformer models

Speaking of them, that is, about transformer models. Let's try to apply the fine-tuning technique to pre-trained transformer NLP models in order to solve our problem of determining the genre of a film by description.

The experiment was carried out on the following pre-trained models from the resource hugging face:

Transformer models work in conjunction with their own vectorizers (tokenizers). The source code for text tokenization is given below.

Source
from transformers import BertTokenizer, AutoTokenizer


selected_model="ai-forever/ruBert-base"
# Load the tokenizer.
tokenizer = AutoTokenizer.from_pretrained(selected_model)

from torch.utils.data import TensorDataset


def make_dataset(texts, labels):
    # Tokenize all of the sentences and map the tokens to thier word IDs.
    input_ids = []
    attention_masks = []
    token_type_ids = []

    # For every sentence...
    for sent in texts:
        # `encode_plus` will:
        #   (1) Tokenize the sentence.
        #   (2) Prepend the `[CLS]` token to the start.
        #   (3) Append the `[SEP]` token to the end.
        #   (4) Map tokens to their IDs.
        #   (5) Pad or truncate the sentence to `max_length`
        #   (6) Create attention masks for [PAD] tokens.
        encoded_dict = tokenizer.encode_plus(
                            sent,                      # Sentence to encode.
                            add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                            max_length = 500,          # Pad & truncate all sentences.
                            padding='max_length',
                            return_attention_mask = True,   # Construct attn. masks.
                            return_tensors="pt",     # Return pytorch tensors.
                            truncation=True,
                            return_token_type_ids=True
                    )

        # Add the encoded sentence to the list.
        input_ids.append(encoded_dict['input_ids'])

        token_type_ids.append(encoded_dict['token_type_ids'])

        # And its attention mask (simply differentiates padding from non-padding).
        attention_masks.append(encoded_dict['attention_mask'])

    # Convert the lists into tensors.
    input_ids = torch.cat(input_ids, dim=0)
    token_type_ids = torch.cat(token_type_ids, dim=0)
    attention_masks = torch.cat(attention_masks, dim=0)
    labels = torch.tensor(labels.values)
    dataset = TensorDataset(input_ids, token_type_ids, attention_masks, labels)

    return dataset

The fransformers library provides a set of classes that equip the model with tools for solving standard problems. In particular, the AutoModelForSequenceClassification class is suitable for solving our problem. Using the parameter problem_type=”multi_label_classification” we indicate that we are interested in the multilabel classification. In this case, the following loss function will be used: BCEWithLogitsLoss.

Source
import transformers

model = transformers.AutoModelForSequenceClassification.from_pretrained(
    selected_model, 
    problem_type="multi_label_classification",
    num_labels = 26, # The number of output labels--2 for binary classification.
                    # You can increase this for multi-class tasks.
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)

Next, standard neural network training was performed. To track metrics that change as training progresses, I connected the excellent tensorboard utility. The graphs below were obtained using it.

Source
from torch.utils.tensorboard import SummaryWriter
from sklearn import metrics

def log_metrics(writer, loss, outputs, targets, postfix):
    print(outputs)
    outputs = np.array(outputs)
    predictions = np.zeros(outputs.shape)
    predictions[np.where(outputs >= 0.5)] = 1
    outputs = predictions
    accuracy = metrics.accuracy_score(targets, outputs)
    f1_score_micro = metrics.f1_score(targets, outputs, average="micro")
    f1_score_macro = metrics.f1_score(targets, outputs, average="macro")
    recall_score_micro = metrics.recall_score(targets, outputs, average="micro")
    recall_score_macro = metrics.recall_score(targets, outputs, average="macro")
    precision_score_micro = metrics.precision_score(targets, outputs, average="micro", zero_division=0.0)
    precision_score_macro = metrics.precision_score(targets, outputs, average="macro", zero_division=0.0)

    writer.add_scalar(f'Loss/{postfix}', loss, epoch)
    writer.add_scalar(f'Accuracy/{postfix}', accuracy, epoch)
    writer.add_scalar(f'F1 (Micro)/{postfix}', f1_score_micro, epoch)
    writer.add_scalar(f'F1 (Macro)/{postfix}', f1_score_macro, epoch)
    writer.add_scalar(f'Recall (Micro)/{postfix}', recall_score_micro, epoch)
    writer.add_scalar(f'Recall (Macro)/{postfix}', recall_score_macro, epoch)
    writer.add_scalar(f'Precision (Micro)/{postfix}', precision_score_micro, epoch)
    writer.add_scalar(f'Precision (Macro)/{postfix}', precision_score_macro, epoch)

The network was trained in a standard way. Since I had a computer with an NVIDIA GeForce RTX 2080 Ti (12 GB) video card, training was performed using the GPU. At the same time, for different models it was necessary to use different batch_sizes, and the time to reach the minimum of the loss function varied significantly. For ease of perception, I have collected this data in the table below.

Source
optimizer = torch.optim.AdamW(model.parameters(),
                  lr = 2e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5
                  eps = 1e-8 # args.adam_epsilon  - default is 1e-8.
                )

from transformers import get_linear_schedule_with_warmup

# Number of training epochs. The BERT authors recommend between 2 and 4.
# We chose to run for 4, but we'll see later that this may be over-fitting the
# training data.
epochs = model_setup[selected_model]['epochs']

# Total number of training steps is [number of batches] x [number of epochs].
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

from tqdm import tqdm

def train(epoch):
    # print(f'Epoch {epoch+1} training started.')
    total_train_loss = 0
    model.train()
    fin_targets=[]
    fin_outputs=[]
    with tqdm(train_dataloader, unit="batch") as tepoch:
        for data in tepoch:
            tepoch.set_description(f"Epoch {epoch+1}")
            ids = data[0].to(device, dtype = torch.long)
            mask = data[2].to(device, dtype = torch.long)
            token_type_ids = data[1].to(device, dtype = torch.long)
            targets = data[3].to(device, dtype = torch.float)

            res = model(ids,
                             token_type_ids=None,
                             attention_mask=mask,
                             labels=targets)
            loss = res['loss']
            logits = res['logits']

            optimizer.zero_grad()
            total_train_loss += loss.item()
            fin_targets.extend(targets.cpu().detach().numpy().tolist())
            fin_outputs.extend(torch.sigmoid(logits).cpu().detach().numpy().tolist())

            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            scheduler.step()
            tepoch.set_postfix(loss=loss.item())
    
    return total_train_loss / len(train_dataloader), fin_outputs, fin_targets


def validate(epoch):
    model.eval()
    fin_targets=[]
    fin_outputs=[]
    total_loss = 0.0
    with torch.no_grad():
        for step, data in enumerate(test_dataloader, 0):
            ids = data[0].to(device, dtype = torch.long)
            mask = data[2].to(device, dtype = torch.long)
            token_type_ids = data[1].to(device, dtype = torch.long)
            targets = data[3].to(device, dtype = torch.float)
            res = model(ids,
                             token_type_ids=None,
                             attention_mask=mask,
                             labels=targets)
            loss = res['loss']
            logits = res['logits']
            total_loss += loss.item()
            fin_targets.extend(targets.cpu().detach().numpy().tolist())
            fin_outputs.extend(torch.sigmoid(logits).cpu().detach().numpy().tolist())
    return total_loss/len(test_dataloader), fin_outputs, fin_targets

writer = SummaryWriter(comment="-" + selected_model.replace('/', '-'))

for epoch in range(epochs):
    avg_train_loss, outputs, targets = train(epoch)

    log_metrics(writer, avg_train_loss, outputs, targets, 'train')
    
    avg_val_loss, outputs, targets = validate(epoch)
    log_metrics(writer, avg_val_loss, outputs, targets, 'val')

Now let's look at the graphs. For convenience, I have included a “legend” nearby, from which it is not difficult to guess which model the graphs belong to. Let’s start with the loss function.

alt

alt

alt

alt

It can be seen that the best result was obtained for the “mid-size” model “ai-forever/ruBert-base”. “cointegrated/rubert-tiny2” remained far behind the winner, which is understandable. Interestingly, the “large” models “ai-forever/ruBert-large” and “ai-forever/ruRoberta-large” were inferior in quality to the base model. In the case of “ai-forever/ruBert-large” this is most likely caused by not the most accurate training parameters, and, for example, reducing the learning rate could make this model a leader.

Let's also look at other graphs. No wonder I wasted my time on them 🙂

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

alt

It can be seen that despite the fact that the loss function on the validation set was already beginning to increase, the recall and f1 metrics continued to improve further, but the precision metric deteriorated slightly.

And now the promised sign. The best values ​​are shown in bold. The last column is the time until the minimum of the loss function is reached on the validation set.

Model name

Loss

Precision (micro/ macro)

Recall (micro/ macro)

F1 (micro/macro)

Batch size

Time, min

cointegrated/rubert-tiny2

0.1819

0.6307 / 0.4246

0.373 / 0.2136

0.4688 / 0.2661

32

25

ai-forever/ruBert-base

0.1553

0.6863 / 0.5963

0.5039 / 0.4105

0.5811 / 0.4907

8

57

ai-forever/ruBert-large

0.1582

0.6673 / 0.5817

0.4922 / 0.3824

0.5665 / 0.4482

2

112

ai-forever/ruRoberta-large

0.1644

0.6672 / 0.6275

0.5457 / 0.4672

0.6004 / 0.5285

2

320

You can notice that “ai-forever/ruRoberta-large” scored the largest number of best metrics, despite the not the best value of the loss function. If it weren't for the length of the training, I probably would have declared her the winner. But still, “ai-forever/ruBert-base” is declared the winner.

Below we will consider the results of this model only.

ruBERT-base classification report

alt

alt

The metric values ​​look nicer than those of classic models.

Comparison classification report

Let's compare the results by looking at the classification report tables.

Logistic regression

alt

alt

CatBoost

alt

alt

ruBert-base

alt

alt

The value of the recall metric, as previously mentioned, is the best for our simplest model – logistic regression. The transformer model has the best values ​​for the precision and f1 metrics.

Examples

Let's look at a few examples.

Gentlemen (2019)

alt

alt

One cunning American had been selling drugs since his student days, and now he came up with a scheme for illegal enrichment using the estates of the impoverished English aristocracy and got very rich from it. Another nosy journalist comes to Ray, the American's right-hand man, and offers him to buy a film script that details his boss's crimes with the participation of other representatives of the London criminal world – a Jewish partner, the Chinese diaspora, black athletes and even a Russian oligarch.

Movie search summary: crime, comedy, action
Logistic regression: biography, action, documentary, crime
Catboost: crime
BERT: drama, crime

How to Train Your Dragon (2010)

alt

alt

You will learn the story of teenager Hiccup, who is not too close to the traditions of his heroic tribe, which has been waging war against dragons for many years. Hiccup's world is turned upside down when he unexpectedly meets the dragon Toothless, who will help him and other Vikings see the familiar world from a completely different perspective…

Movie search summary: cartoon, fantasy, comedy, adventure, family

Logistic regression: anime, military, documentary, history, short, cartoon, adventure, family, fantasy
Catboost: Drama
BERT: cartoon, adventure, family, fantasy

How to make good use of skipping school (2017)

alt

alt

Following the city boy Paul, viewers will have to learn what they don’t teach at school. Namely, how to live in the real world. At least if it's a forest world. There is an owner here – the gloomy count, there is power – the good-natured but strict forester Borel, and there is the poacher Totosh – a man who decided to be outside the law, and in general a suspicious and unpleasant type. Which side will Paul choose: the respectable forester Borel or the poacher Totosh? Or maybe the young tomboy will become the arrogant count’s best friend?

Movie search summary: drama, comedy, family

Logistic regression: anime, children's, cartoon, musical, adventure, family, fantasy
Catboost:
BERT: cartoon, family

My comment: the film is so strange that Catboost refused to classify it 🙂

How I Met your mother

alt

alt

The series takes place in two times: in the future – 2034, where dad tells the children about meeting their mother and the stages of creating their family, and in the present, where we can see how it all began. The hero today is the young architect Dima, who does not yet know how his life will turn out. One day he even thinks that he has met the girl of his dreams… But is this really so? His friends help Dima figure this out: Pasha and Lyusya, a young couple about to get married, as well as their mutual friend Yura, an orthodox bachelor and cynic, whose ideal is a casual one-night stand. He believes that there is nothing more stupid than a long relationship, and marriage is an outdated concept.

Film search summary: comedy
Logistic regression: drama, comedy, romance
Catboost: drama, melodrama
BERT: comedy

Conclusion

The data used in my experiment, as is often the case, is imperfect. For example, for the animated film “How to Train Your Dragon”, in my subjective opinion, the comedy is not visible from the description. And this description was not written with the goal of preparing a good data set for machine learning 🙂 And information about genres only complements the description. Yes, and it is rather subjective.

Nevertheless, the experiment turned out to be interesting. I hope for the readers too 🙂

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *