How Do Simple NLP Models See Words? | NLP

How do models see our text?

When you start to dive into the sphere NLPyou immediately wonder how the models imagine our text/our words? After all, it would not sound logical if the model processed our words as a normal sequence of letters. It would be inconvenient and incomprehensible(how to perform operations with words?).

There are different methods of transforming words. One of the most famous for not the most complex models: TF-IDF.

How does TF-IDF work?

TF-IDF(Term Frequency-Inverse Document Frequency) — is a method that converts words into numerical vectors, making them more understandable for machine learning models.

Moreover, the numerical vectors here contain TF-IDF values, not just any numbers.

Meanings TF-IDF try to show how important the word is for us in this document (set of text data).

When we have text data, we need to break it into pieces. This can be done by sentences, semantic paragraphs, whole texts, or some other way.

Here are the calculation formulas TF-IDF:

TF(t, d) = \frac{n_t}{\sum_{k} n_k}

t is our word
d – our document

In the formula, we divide the number of occurrences of our word in a given document by the total number of words in the document.

TF(Term Frequency) – indicates the frequency of our word in the document.

IDF(t, D) = log(\frac{|D|}{|\{ d_i\in D | t \in d_i \}|})

t is our word
D – document corpus (list of all text data)

In the formula we take the logarithm of dividing the total number of all documents (length D) by the number of documents from the corpus D that contain our word.

The logarithm is used to smooth the values.

IDF(Inverse Document Frequency) – denotes the inverse frequency with which our word occurs in documents of the corpus.

Multiplying TF And IDFwe obtain the formula TF-IDF:

TF\_IDF(t, d, D) = TF(t, d) * IDF(t, D)

Big weight in TF-IDF will obtain words with high frequency within a specific document and low frequency in other documents.

Plan for writing your TF-IDF.

To consolidate and better understand, let's write TF-IDF ourselves.

Plan:

  • Let's agree on the format in which the text data will be received

  • let's count TF-IDF

  • we will return the necessary information

We will receive data in the format of a list of documents (in our case, a list of proposals).

Will return a matrix of values TF-IDF for each word/document.

Code

First, we import the required libraries.

import numpy as np
import pandas as pd

We will receive the data, bring it into the required format – each new document (we have a sentence) will be on a new line, so we will separate them by “\n“, we get a list.
(The text was taken from a random article on Wikipedia.)

with open('data.txt', 'r') as file:
    content = file.read()
    print(content)

data = content.split("\\n")

Now the hardest part – let's write a function tf-idf.

Let's break it down into four parts:

#1
vocab = [] # cоздаем список всех слов.
for text in texts: # проходимся по каждому документу(преложению)
  words = text.split() # разбиваем его на слова
  for word in words: # проходимся по каждому слову
    if not word in vocab: # если такого слова еще нет в нашем списке
      vocab.append(word) # добавлем новое слово

This is how we fill the list with unique words.

#2

# наш словарь TF в формате: "слово": [tf_в_документе1, tf_в_документе2, tf_в_документеN]
tf_dict = {}
for word in vocab: # проходимся по каждому слову из списка всех слов
  tf_dict_this_word = [] # список всех tf(для каждого документа ) для данного слова
  for text in texts: # проходимся по всем документам
    if word in text: # если слово в документе
      count_word = text.split().count(word) # считаем сколько его в этом документе
      
      # считаем tf для данного документа и добавляем в список всех tf для этого слова
      # tf = кол-во слова в документе / длина документа 
      tf_dict_this_word.append(count_word/len(text.split()))
    else:
      tf_dict_this_word.append(0) # если слова в этом документе нет - добавляем 0
  tf_dict[word] = tf_dict_this_word # добавлем новую запись в наш словарь в нужно формате

We already have a table TF for all words to all documents.

#3
idf_dict = {} # словарь IDF в формате: "слово" : его_idf
for word in vocab: # проходимся по всем словам

  # считаем во скольки документах есть данное слово
  count_word = sum(1 for text in texts if word in text.split())

  # считаем idf и записывем в словарь
  # idf = log( кол-во всех документов / кол-во документов содержащих наше слово )
  idf_dict[word] = np.log(len(texts) / count_word)

IDF We've calculated, all that's left is to multiply and return.

#4

# словарь посчитанных tf-idf( "слово" : [tf-idf_в_док1, tf-idf_в_док2, tf-idf_в_докN] )
tf_idf = {}
for word in vocab: # проходимся по каждому слову

  # по элементно умножаем idf слова на список tf для этого слова
  # tf-idf = tf * idf
  tf_idf[word] = np.array(tf_dict[word])*idf_dict[word]

We collect it into one function and return:

def tf_idf(texts: list):

    #vocab
    vocab = []
    for text in texts:
        words = text.split()
        for word in words:
            if not word in vocab:
                vocab.append(word)

    #tf
    tf_dict = {}
    for word in vocab:
        tf_dict_this_word = []
        for text in texts:
            if word in text:
                count_word = text.split().count(word)
                tf_dict_this_word.append(count_word/len(text.split()))
            else:
                tf_dict_this_word.append(0)
        tf_dict[word] = tf_dict_this_word

    #idf
    idf_dict = {}
    for word in vocab:
        count_word = sum(1 for text in texts if word in text.split())
        idf_dict[word] = np.log(len(texts) / count_word)

    #tf-idf
    tf_idf = {}
    for word in vocab:
        tf_idf[word] = np.array(tf_dict[word])*idf_dict[word]

    return tf_idf

Example of use

(Taken text from of this article)

Our text data(data.txt):

Altov's works were performed by Gennady Khazanov (Hercules, Vobla, Choir at the Embassy, ​​Wolves and Sheep, Swimming Trunks), Klara Novikova (Carmen), Efim Shifrin (Penitent Mary Magdalene, Assassination, Wandering Breast, Cinderella, Oasis, Sexonfu, Bull, Personal Example), Vladimir Vinokur (Somersault of Fate).
In addition, the author also performs his own works.
Semyon Altov stands out among other performing humorous writers with his unique performing style – Altov reads his monologues with an impenetrable and even gloomy expression on his face, in a monotonous low voice with a unique accent, without even smiling.
Altov's manner of pronunciation is parodied by many pop artists (the Ponomarenko Brothers, Igor Khristenko, etc.).

my_tf_idf = tf_idf(data) # используем нашу функцию, передавая наши данные

# для визуала создаем DataFrame и транспонируем его(для красоты, опять же) 
tfidf_table = pd.DataFrame(my_tf_idf).T

print(tfidf_table) # смотрим результат

We see such a table of values TF-IDF:

Hooray, we did it – we wrote our own TF-IDF.

Thank you♥

Resources

Code on Google Colab
Wikipedia
The article we used
TF_IDF
Github
Kaggle
I'm on Github
I'm on Kaggle
My Kaggle Dataset

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *