Analysis of negative comments TRUE CRIME

Hi! I'm actively trying to cover different areas of Data Science and thought it would be cool to dig into Natural Language Processing (NLP) using YouTube comments as an example. Since I often watch videos after work Sasha SulimI asked myself: “I wonder if there are differences in how viewers evaluate videos about maniacs depending on gender!? Or does it not matter to us who the killer was – a man/woman?”

So I came to the conclusion that I can take the task of classifying comments by assessing their negativity as a pet project. I suggest you evaluate how well this turned out.

All code can be found in githuband within the framework of this article I will describe in more detail the process of researching this topic.

Dataset

I chose for training dataset from Kaggle from comments collected from the site 2ch.hk And pikabu.ruThe average comment is 175 characters long, the minimum comment length is 21 characters, the maximum is 7,403.

EDA (Exploratory Data Analysis)

First, let's see what our dataset is. To do this, we'll conduct a standard analysis:

df = pd.read_csv("./data/labeled.csv", sep=',')
df.shape
>>> (14412, 2)

# преобразуем значения колонки «toxic» к типу (int) для удобства
df["toxic"] = df["toxic"].apply(int)

df["toxic"].value_counts()
>>> 0    9586
>>> 1    4826

# проверим, что нет пустых значений
df[df["toxic"] == 0]["comment"].isna().sum()
>>> 0

So, we found out that the dataset is 14,412 comments. The distribution in this set is as follows: 4,826 – negative, 9,586 – neutral.

Text preprocessing

Any raw data needs to be pre-processed. There are several important steps to do this: tokenization, removing punctuation and stop words, and stemming. Let's get started!

# возьмем для примера один комментарий
example = df.iloc[1]["comment"]
print(f"Исходный текст: {example}")
>>> Исходный текст: Хохлы, это отдушина затюканого россиянина, мол, вон, а у хохлов еще хуже. Если бы хохлов не было, кисель их бы придумал.

# разобьем на токены
tokens = word_tokenize(example, language="russian")
print(f"Токены: {tokens}")
>>> Токены: ['Хохлы', ',', 'это', 'отдушина', 'затюканого', 'россиянина', ',', 'мол', ',', 'вон', ',', 'а', 'у', 'хохлов', 'еще', 'хуже', '.', 'Если', 'бы', 'хохлов', 'не', 'было', ',', 'кисель', 'их', 'бы', 'придумал', '.']

# уберем всю пунктуацию и стоп-слова
tokens_without_punct = [i for i in tokens if i not in string.punctuation]
stop_words = stopwords.words("russian")
print(f"Токены без пунктуации: {tokens_without_punct}")
print(f"Токены без пунктуации и стоп слов: {tokens_without_punct_and_stopwords}")
>>> Токены без пунктуации: ['Хохлы', 'это', 'отдушина', 'затюканого', 'россиянина', 'мол', 'вон', 'а', 'у', 'хохлов', 'еще', 'хуже', 'Если', 'бы', 'хохлов', 'не', 'было', 'кисель', 'их', 'бы', 'придумал']
>>> Токены без пунктуации и стоп слов: ['Хохлы', 'это', 'отдушина', 'затюканого', 'россиянина', 'мол', 'вон', 'хохлов', 'хуже', 'Если', 'хохлов', 'кисель', 'придумал']

# далее Стемминг - процесс приведения слов к их базовой/корневой форме. 
tokens_without_punct_and_stopwords = [i for i in tokens_without_punct if i not in stop_words]
snowball = SnowballStemmer(language="russian")
stemmed_tokens = [snowball.stem(i) for i in tokens_without_punct_and_stopwords]
print(f"Токены после стемминга: {stemmed_tokens}")
>>> Токены после стемминга: ['хохл', 'эт', 'отдушин', 'затюкан', 'россиянин', 'мол', 'вон', 'хохл', 'хуж', 'есл', 'хохл', 'кисел', 'придума']

Since the preprocessing process will be repeated, we will create a function for convenience that repeats all the above transformations.

snowball = SnowballStemmer(language="russian")
russian_stop_words = stopwords.words("russian")

def tokenize_sentence(sentence: str, remove_stop_words: bool = True):
    tokens = word_tokenize(sentence, language="russian")
    tokens = [i for i in tokens if i not in string.punctuation]
    if remove_stop_words:
        tokens = [i for i in tokens if i not in russian_stop_words]
    tokens = [snowball.stem(i) for i in tokens]
    return tokens

Great, now let's split our dataset into training and test samples and compare their distribution.

train_df, test_df = train_test_split(df, test_size = 500, random_state=234)
print(train_df.shape)
print(test_df.shape)
>>> (13912, 2)
>>> (500, 2)

# сравним распределение целевого признака
for sample in [train_df, test_df]:
    print(sample[sample['toxic'] == 1].shape[0] / sample.shape[0])
>>> 0.3356095457159287
>>> 0.314

Received distribution:

Training sample

33.56% toxic comments

Test sample

31.4% toxic comments

The data is uniformly distributed across samples, so our future model should be adequately evaluated on the test data.

TF-IDF

Before we can train our model, we need to transform our comments into numeric arrays. To do this, we'll use TF-IDF vectorization.

TF measures how often a term (word) appears in a document. The formula for calculating TF is:

\text{TF}(t, d) = \frac{f(t, d)}{N_d}

where f(t,d) is the number of occurrences of term t in document d, and Nd — the total number of terms in document d.

IDF measures the importance of a term relative to the entire corpus of documents. The less frequently a term occurs in a corpus, the higher its IDF. The formula for calculating IDF is:

  \text{IDF}(t, D) = \log \left( \frac{N}{\left|\{d \in D : t \in d\}\right|} \right)

where N is the total number of documents in the corpus D, and ∣{d∈D:t∈d}∣ is the number of documents containing term t.

TF-IDF combines TF and IDF to estimate the importance of a term in a given document. The formula for calculating TF-IDF is:

  \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)

To use TF-IDF we will use the library scikit-learn.

# инициализируем векторайзер и применим к нашим выборкам
count_idf_1 = TfidfVectorizer(ngram_range = (1,1), tokenizer=lambda x: tokenize_sentence(x, remove_stop_words=True))
tf_idf_base_1 = count_idf_1.fit(df['comment'])
tf_idf_train_base_1 = count_idf_1.transform(train_df['comment'])
tf_idf_test_base_1 = count_idf_1.transform(test_df['comment'])

# выведем размеры матриц, чтобы убедиться в корректности:
print(tf_idf_train_base_1.shape)
print(tf_idf_test_base_1.shape)
>>> (13912, 36122)
>>> (500, 36122)

As an example, let's look at how TF-IDF occurs on one of the comments.

sample = test_df.sample(n=1)['comment']
sample_tf_idf = count_idf_1.transform(sample)
sample_tf_idf.shape
>>> (1, 36122)

array = sample_tf_idf.toarray()
array
>>> array([[0., 0., 0., ..., 0., 0., 0.]])

# как выглядит наш комментарий до векторизации
sample
>>> 12391    Что касается 3 млн, у Кия самая дорогая машина...

# извлекаем и выводим ненулевые элементы, которые соответствуют значимым словам:
array[array!= 0]
>>> array([0.27552192, 0.25845753, 0.24785363, 0.19574676, 0.13724815,
           0.25845753, 0.13854953, 0.21636683, 0.18436214, 0.2040751 ,
           0.25845753, 0.23449431, 0.13459448, 0.37887959, 0.20099479,
           0.14063173, 0.15832929, 0.10074052, 0.11669742, 0.25845753,
           0.25845753, 0.06473031])

Now that our comments have a vector representation, we can move on to training the model.

Model training

I used logistic regression as a baseline because it is well suited for binary classification problems.

If you are not yet familiar with this model, but have already heard about linear regression, then we can say that you are almost an expert. The fact is that logistic regression is essentially a linear regression, to the result of which a logistic function (for example, sigmoid) is applied at the end.

Sigmoid function formula:

\sigma(z) = \frac{1}{1 + e^{-z}}

where z is a linear combination of features and their weights: z = β01x12x2+…+βnxn.

The value of σ(z) lies between 0 and 1, which is interpreted as a probability.

# инициализируем модель
model_lr_base_1 = LogisticRegression(solver="lbfgs", random_state=234, max_iter= 10000, n_jobs= -1)

# обучим модель
model_lr_base_1.fit(tf_idf_train_base_1, train_df['toxic'])

# получим прогноз вероятностей классов
predict_lr_base_proba = model_lr_base_1.predict_proba(tf_idf_test_base_1)
predict_lr_base_proba
>>> array([[0.85603587, 0.14396413],
           [0.29448938, 0.70551062],
           [0.41543358, 0.58456642],
           [0.77011541, 0.22988459],
           [0.62820949, 0.37179051],
           ...
           [0.82299013, 0.17700987]])

Each line predict_lr_base_proba represents a pair of numbers: the probability of a non-toxic comment (the first number) and the probability of a toxic comment (the second number), respectively.

Model evaluation

I also suggest comparing the quality of our model with a random classifier.

def coin_classifier(X:np.array) -> np.array:
    predict = np.random.uniform(0.0, 1.0, X.shape[0])
    return predict
coin_predict = coin_classifier(tf_idf_test_base_1)

Let's visualize the ROC curves and output the error matrix.

# для нашей модели логистической регрессии
fpr_base, tpr_base, _ = roc_curve(test_df['toxic'], predict_lr_base_proba[:, 1])
roc_auc_base = auc(fpr_base, tpr_base)

# для случайного классификатора 
fpr_coin, tpr_coin, _ = roc_curve(test_df['toxic'], coin_predict)
roc_auc_coin = auc(fpr_base, tpr_base)

fig = make_subplots(1,1,
                    subplot_titles = ["Receiver operating characteristic"],
                    x_title="False Positive Rate",
                    y_title = "True Positive Rate"
                   )
fig.add_trace(go.Scatter(
    x = fpr_base,
    y = tpr_base,
    #fill="tozeroy",
    name = "ROC base (area = %0.3f)" % roc_auc_base,
    ))
fig.add_trace(go.Scatter(
    x = fpr_coin,
    y = tpr_coin,
    mode="lines",
    line = dict(dash="dash"),
    name="Coin classifier (area = 0.5)"
    ))
fig.update_layout(
    height = 600,
    width = 800,
    xaxis_showgrid=False,
    xaxis_zeroline=False,
    template="plotly_dark",
    font_color="rgba(212, 210, 210, 1)"
    )

# матрица ошибок
confusion_matrix(test_df['toxic'],
                 (predict_lr_base_proba[:, 1] > 0.5).astype('float'),
                 normalize="true",
                )
>>> array([[0.97959184, 0.02040816],
       [0.35031847, 0.64968153]])
  • The AUC of the random classifier is close to 0.5, indicating that this classifier is unable to effectively distinguish between classes.

  • The logistic regression model shows significantly better results than the random classifier, confirming its value in the task of comment classification.

Parsing comments

Finally, let's move on to the final part – to our comments under Sasha Sulim's video! Let's first parse all the comments from video about female maniacs.

# инициализируем Chrome WebDriver с использованием chromedriver-py
driver = webdriver.Chrome(executable_path=binary_path)

# создаем список для результатов парсинга
scrapped = []

# указываем время ожидания в секундах и URL видео
wait = WebDriverWait(driver, 10)
driver.get("https://www.youtube.com/watch?v=Bru4DtUe_CE&t=4s")

# задаем количество прокруток для загрузки комментариев
for item in tqdm(range(200)):
    wait.until(EC.visibility_of_element_located((By.TAG_NAME, "body"))).send_keys(Keys.END)
    time.sleep(2)

# получаем комментарии по тэгу "#content"
for comment in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#content"))):
    scrapped.append(comment.text)

# Закрываем браузер
driver.quit()

Now let's clean up the comments from unnecessary things and save them for ourselves.

comments = []
for part in scrapped[0].split('назад'):
    split_part = part.split('\nОТВЕТИТЬ')[0].split('\n')
    if len(split_part) > 1:
        comments.append(split_part[1])
comments = comments[3:]  # удалим лишние

comments_woman = comments + scrapped[1:]
comments_woman_df = pd.DataFrame({'comment':comments_woman})

comments_woman_df.to_csv('/Users/amakarshina/Desktop/Toxic_comments/Pet-projects/Toxic_comments/data/' + 'comments_woman.csv')
comments_woman_df = comments_woman_df[comments_woman_df['comment'].str.len() > 0]
comments_woman_df
Example of comments from a video about female killers.

Example of comments from a video about female killers.

At the time of writing, there were 2,358 comments under the video about female killers.

Now let's repeat the parsing for video about a male maniac.

driver = webdriver.Chrome(executable_path=binary_path)
scrapped_man = []
wait = WebDriverWait(driver, 10)
driver.get("https://www.youtube.com/watch?v=_8bXHh3pOvA&t=156s")
for item in tqdm(range(200)):
    wait.until(EC.visibility_of_element_located((By.TAG_NAME, "body"))).send_keys(Keys.END)
    time.sleep(2)
for comment in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#content"))):
    scrapped_man.append(comment.text)
driver.quit()
# отчистим от лишнего
comments_man = []
for part in scrapped_man[0].split('назад'):
    split_part = part.split('\nОТВЕТИТЬ')[0].split('\n')
    if len(split_part) > 1:
        comments_man.append(split_part[1])
# сохраним
comments_man = comments_man + scrapped[1:]
comments_man_df = pd.DataFrame({'comment':comments_man})

comments_man_df.to_csv('/Users/amakarshina/Desktop/Toxic_comments/Pet-projects/Toxic_comments/data/' + 'comments_man.csv')
comments_man_df = comments_man_df[comments_man_df['comment'].str.len() > 0]
An example of comments from a video about a male killer.

An example of comments from a video about a male killer.

At the time of writing, the Jack the Ripper video had 2,323 comments.

Keywords

For greater clarity, we will visualize the keywords that are most often found in our comments.

man_counter = CountVectorizer(ngram_range=(1, 1))
woman_counter = CountVectorizer(ngram_range=(1, 1))

# применяем счетчики к текстам
man_count = man_counter.fit_transform(comments_man_df['text_clear'])
woman_count = woman_counter.fit_transform(comments_woman_df['text_clear'])

# создаем DataFrame с частотами слов
man_frequence = pd.DataFrame(
    {'word': man_counter.get_feature_names_out(),
     'frequency': man_count.toarray().sum(axis=0)}
).sort_values(by='frequency', ascending=False)

woman_frequence = pd.DataFrame(
    {'word': woman_counter.get_feature_names_out(),
     'frequency': woman_count.toarray().sum(axis=0)}
).sort_values(by='frequency', ascending=False)
display(man_frequence.shape[0])
display(woman_frequence.shape[0])

# фильтруем уникальные слова
man_frequence_filtered = man_frequence.query('word not in @woman_frequence.word')[:100]
woman_frequence_filtered = woman_frequence.query('word not in @man_frequence.word')[:100]

# Создаем облако слов
wordcloud_man = WordCloud(
    background_color="black",
    colormap='Blues',
    max_words=200,
    width=1600,
    height=1600
).generate_from_frequencies(dict(man_frequence_filtered.values))

# создаем облако слов
wordcloud_woman = WordCloud(
    background_color="black",
    colormap='Oranges',
    max_words=200,
    width=1600,
    height=1600
).generate_from_frequencies(dict(woman_frequence.values))

# Визуализируем
fig, ax = plt.subplots(1, 2, figsize=(20, 12))

ax[0].imshow(wordcloud_man, interpolation='bilinear')
ax[1].imshow(wordcloud_woman, interpolation='bilinear')

ax[0].set_title(
    f'Топ 100 слов наиболее частотных,\n уникальных слов в комментариях мужчин',
    fontsize=20
)
ax[1].set_title(
    f'Топ 100 слов наиболее частотных,\n уникальных слов в комментариях женщин',
    fontsize=20
)

ax[0].axis("off")
ax[1].axis("off")

plt.show()

Model evaluation on our videos

Let's move on to the final assessment: we will find the proportion of negative comments at the optimal threshold value.

woman_share_neg = (comments_woman_df['negative_proba'] >  0.575758).sum() / comments_woman_df.shape[0]
woman_share_neg
>>> 0.766156462585034

man_share_neg = (comments_man_df['negative_proba'] >  0.575758).sum() / comments_man_df.shape[0]
man_share_neg
>>> 0.7492447129909365

conclusions

  • High proportion of negative comments: Both videos have a significant share of negatively coloured comments, exceeding 70%. This indicates that the majority of comments under TRUE CRIME videos are indeed negative.

  • Little difference between the sexes: The proportion of negative comments under the video about female killers is slightly higher than the proportion of negative comments under the video about male maniacs (0.766 vs. 0.749). This indicates that, overall, the differences in the tone of comments between these two types of videos are insignificant.

I hope this little study was interesting for you, I will be glad if you subscribe to me here or on telegram – channelwhere I write about my development in the field of Data Science and share my progress. I wish everyone great projects!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *