what is it and why is it needed?

Imagine that you are reading a book and want to find all the places where the word “cat” is mentioned. I don’t know why you need this, but for now let’s settle on the fact that you want it. This is really necessary.

So how to do this?

You can just flip through the book and read it from beginning to end, literally finding all the cats by hand, but… This can take a lot of time and effort. It will be much easier to use the index at the end of the book, which lists all the places where the word “cat” is mentioned. The problem is that there is no such thing in a regular printed book, but if you read electronically, yes, quite. You can use the word search.

But you can do this, but computers cannot.

Computers cannot simply read text and understand what it means. They need the help of tokenizers, which convert text into a set of tokens, or individual units of information that can be analyzed and processed.

Tokenization is the first step in processing text data. Without tokenization, computers would not be able to understand text and find useful information in it. Tokenizers help convert text into data that can be analyzed and used to solve various problems such as text classification, speech recognition, machine translation and many others.

Tokenizers, like electronic text search engines, help computers efficiently find and organize relevant information, just as electronic indexes in e-books make it easier to find specific phrases. Without them, it would be much more difficult for computers to “understand” and analyze text data.

So far everything is clear, right? If so, then move on.

Popular tokenizers: who will trample whom in the fight for better tokenization?

There are a lot of tokenizers, and each of them has its “light” and “dark” sides. Some of them are better at handling certain types of text data, while others are better at handling others.

Now let's look at some of the most popular tokenizers in the libraries, compare them with each other, so that in the future, already during work, you can choose the most suitable tokenizer for your task.

NLTK (Natural Language Toolkit)

NLTK (Natural Language Toolkit) is one of the most famous libraries for processing text data in Python. Includes several different tokenizers, but we'll look at three: RegexpTokenizer, TreebankWordTokenizer, and WhitespaceTokenizer.

RegexpTokenizer, TreebankWordTokenizer and WhitespaceTokenizer are three different tokenizers that are available in the library NLTK (Natural Language Toolkit) for Python. They are used to divide text into tokens (individual words or other units of text) using various approaches.

RegexpTokenizer uses regular expressions to separate text into tokens. For example, you can use it to split text into words separated by spaces, or to split text into words separated by punctuation.

Here's an example with RegexpTokenizer to split text into words separated by spaces:

from nltk.tokenize import RegexpTokenizer

text = "This is an example text."
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(text)
print(tokens)

This code will output the following:

This, is, an, example, text

TreebankWordTokenizer uses the tokenization rules of the Penn Treebank corpus to divide text into tokens. It is more accurate than RegexpTokenizer, but can be slower and more difficult to use.

It works like this:

from nltk.tokenize import TreebankWordTokenizer

text = "This is an example text."
tokenizer = TreebankWordTokenizer()
tokens = tokenizer.tokenize(text)
print(tokens)

Conclusion:

This, is, an, example, text

But WhitespaceTokenizer uses spaces to separate text into tokens. Although it is the simplest and fastest of the three, it may be less accurate than other tokenizers.

It works like this:

from nltk.tokenize import WhitespaceTokenizer

text = "This is an example text."
tokenizer = WhitespaceTokenizer()
tokens = tokenizer.tokenize(text)
print(tokens)

We get:

This, is, an, example, text

SpaCy

SpaCy is another popular library for processing text data in Python. Known for its high speed and accuracy, as well as support for many languages.

SpaCy has a DocBin tokenizer that can handle large amounts of text data.

Let's take a closer look?

import spacy
from spacy.tokens import DocBin

# Загрузка модели spaCy
nlp = spacy.load("en_core_web_sm")

# Создание пустого объекта DocBin
doc_bin = DocBin()

# Тексты для токенизации
texts = [
    "Hello, world!",
    "This is a sample sentence.",
    "SpaCy is an awesome tool for NLP!",
    "I love working with natural language processing.",
]

# Токенизация каждого текста и добавление его в DocBin
for text in texts:
    doc = nlp(text)
    doc_bin.add(doc)

# Сохранение DocBin в файл
with open("processed_texts.spacy", "wb") as f:
    f.write(doc_bin.to_bytes())

Here we use DocBinto save multiple documents (texts) after tokenizing them using the spaCy library. We are loading the model en_core_web_sm from spaCy for text tokenization. Then we create an empty object DocBin and add each document to it using the method add(). After all documents have been added, we save DocBin to a file with the extension .spacy. This file can then be used for further text processing or analysis in spaCy.

Gensim

Gensim is a Python library that specializes in text mining algorithms and topic modeling. It provides a simple interface for working with text data, including functionality for text vectorization, creating topic modeling models, text comparison, and other tasks.

One of the key components of Gensim is Word2Vec, a model designed to create vector representations of words based on the context in which they occur.

Let's look at an example of using Word2Vec in Gensim:

from gensim.models import Word2Vec

sentences = [["I", "love", "natural", "language", "processing"],
             ["Gensim", "is", "an", "awesome", "library", "for", "NLP"],
             ["Word2Vec", "creates", "word", "embeddings"]]

# Обучение модели Word2Vec
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)

# Получение векторного представления слова "Word2Vec"
vector = model.wv["Word2Vec"]
print(vector)

This code creates and trains a Word2Vec model on a small corpus of text consisting of three sentences. It then obtains the vector representation of the word “Word2Vec” using the trained model and outputs the resulting vector.

Other models are also available in Gensim, such as LDA (Latent Dirichlet Allocation) for topic modeling, TF-IDF for text vectorization, and others.

This is how, for example, we will use text tokenization and vectorization using the TF-IDF model in the Gensim library:

from gensim.sklearn_api import TfIdfTransformer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups

# Загрузка данных
data = fetch_20newsgroups(subset="train")['data']

# Токенизация текста
tokenized_data = [word_tokenize(text.lower()) for text in data]

# Преобразование токенизированных текстов в векторы с помощью CountVectorizer
count_vectorizer = CountVectorizer(tokenizer=lambda x: x, lowercase=False)
X_counts = count_vectorizer.fit_transform(tokenized_data)

# Преобразование векторов в TF-IDF векторы с помощью TfIdfTransformer из Gensim
tfidf = TfIdfTransformer()
X_tfidf = tfidf.fit_transform(X_counts)

# Печать TF-IDF векторов первого документа
print("TF-IDF вектор первого документа:")
print(X_tfidf[0])

Here we use fetch_20newsgroupsto load the “20 Newsgroups” dataset from the library sklearn.datasets. We then tokenize the text with word_tokenize from NLTK, convert the text to lowercase and convert it to vectors using CountVectorizer from scikit-learn. Next we use TfIdfTransformer from Gensim to convert these vectors to TF-IDF vectors. Finally, we print the TF-IDF vector of the first document in the dataset.

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in the context of a collection of documents. TF-IDF vectorization helps to identify the most significant words in a document, taking into account their frequency in the document and the inverse frequency of the word in the document corpus.

A brief comparative analysis of tokenizers: who is better, who is worse, and who is simply elusive?

Let's start with NLTK and SpaCy, two of the most popular text processing libraries in Python.

NLTK is a simpler and more accessible library that provides many tokenizers, while SpaCy is a more powerful and faster library known for its high accuracy and support for many languages. Here it can be called elusive.

Let's imagine that you are working on a social media text mining project and you need to tokenize a large number of short texts, such as tweets. In this case, SpaCy is best suited as it quickly and accurately tokenizes texts, even if they contain non-standard abbreviations and emoji.

NLTK, in turn, can be a little slower and less accurate in this case, so don't waste your time.

Now let's look at Gensim. It is best used for analyzing and processing text data of various types: news, blogs and social networks.

For example, Gensim has a tool called Phrases that can be useful for tokenizing texts containing polyphrases and idioms.

Phrases is a module that uses a language modeling algorithm to determine the frequency of words occurring together and then combine them into bigrams. This is useful for text classification and topic modeling, where it is important to consider the context and meaning of the text.

An example of using Phrases to tokenize text:

from gensim.models.phrases import Phrases, Phraser

sentences = [["I", "love", "natural", "language", "processing"],
             ["Gensim", "is", "an", "awesome", "library", "for", "NLP"],
             ["Word2Vec", "creates", "word", "embeddings"]]

# Обучение модели Phrases
bigram = Phrases(sentences, min_count=1, threshold=1)

# Преобразование текста с помощью Phraser
phrases = Phraser(bigram)

# Токенизация текста с учетом многофразовых выражений
for sent in sentences:
    print(phrases[sent])

After executing the code in the cell, a tokenized version of each sentence will be output, taking into account polyphrases.

Gensim also has a BigramCollocationFinder tool. It is used to identify and process bigrams and collocations in texts. Bigrams are pairs of words that often appear together, such as “white house.”

Collocations are a more general concept that describes the relationships between words in a text, including bigrams, trigrams, and more complex phrases. For example, “play football” may be a collocation often found in texts about football.

Example of using BigramCollocationFinder for text tokenization:

from gensim.models.phrases import Phrases, Phraser
from nltk.tokenize import word_tokenize

# Предложения для токенизации
sentences = ["I love natural language processing.",
             "Gensim is an awesome library for NLP.",
             "Word2Vec creates word embeddings."]

# Токенизация предложений
tokenized_sentences = [word_tokenize(sent) for sent in sentences]

# Создание биграмм с использованием Phrases
bigram = Phrases(tokenized_sentences, min_count=1, threshold=1)
phraser = Phraser(bigram)

# Применение биграмм к токенизированным предложениям
for sent in tokenized_sentences:
    print(phraser[sent])

BigramCollocationFinder in Gensim is used to detect frequently occurring collocations and combine them into a single token for further text processing and analysis.

How to choose a tokenizer for your task

Before choosing a specific tokenizer, it is important to test its performance on your data. Test different tokenizers, compare the results to choose the most suitable one.

For example, you can compare the accuracy and speed of different tokenizers to choose the best option.

Sometimes one tokenizer cannot provide the best results for all types of text data. In such cases, it may be useful to use several for one task. For example, you can use one tokenizer for texts with polyphrases and idioms, and another for scientific texts.

Try different methods and experiment. This will help you achieve the best results.

And some more tips on tokenization:

Remove stop words, normalize text, remove punctuation marks and other operations. This will improve the quality of tokenization and you will get better results.

Neurons that take into account the context and meaning of the text are more accurate in tokenization. Especially for texts with polyphrases and idioms.

For scientific texts with specific terms and abbreviations, it is recommended to use tokenizers that take into account this specific language and terminology.

For texts with bigrams and collocations, such tokenizers can be more accurate than others that do not take into account the frequency and relationships of words in the text.

For texts that have a lot of specific terms and abbreviations, such tokenizers can be more accurate than others that do not take these features into account.

Conclusions: who won the battle of tokenizers, and who remained in the shadows?

In the battle between NLTK and SpaCy, each tool has its own strengths and weaknesses. NLTK has a rich set of tokenizers and broad functionality, making it:

  1. Available;

  2. A universal tool.

SpaCy is a more powerful and faster library with high accuracy and support for multiple languages. Because of this alone, SpaCy is often preferable for tasks that require high speed and accuracy of word processing.

In general, as we have already noted above, when choosing a tokenizer, it is important to take into account the specifics of the task and the features of text data. For example, for short texts such as tweets, SpaCy is more suitable due to its ability to handle non-standard abbreviations and emoji.

But when it comes to working with scientific texts or texts with specific terminology, tokenizers based on rules or statistics can be more effective.

There is no clear winner in the battle of tokenizers. Each tool has its own advantages and disadvantages, and the choice of a specific tokenizer depends on the specific task. As well as requirements for speed and accuracy of text processing. And you shouldn’t forget about the features of the data.

In short, everything needs to be tested, but we tried to explain the general principles. If you read this article to the end, then thank you very much for your attention. I would like to hope that everything was clear and interesting.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *