Parsing Text Data with NLTK and Python

WordNet, allowing you to work with huge amounts of text data. This makes NLTK a powerful tool for analyzing and processing text in different languages.

NLTK is a freely available Python library designed to work with human language. It is a comprehensive set of tools designed for symbolic and statistical natural language processing. It provides easy access to more than 50 text corpora and lexical resources such as WordNet, as well as a set of libraries for classification, tokenization, stemming, part-of-speech tagging, parsing, and semantic reasoning.

Let’s quickly install…(assuming you already have Python)

Open a command prompt or terminal and run the following command:

pip install nltk

NLTK provides access to a variety of text corpora and pre-trained models that can be useful in various NLP tasks. This data is not automatically installed with the library, so it must be downloaded separately. To do this, use the following code:

import nltk

nltk.download('popular')

Team nltk.download('popular') Loads the most commonly used cases and models. If you require specific resources, you can load them by replacing 'popular' to the corresponding name. For example, to download WordNet use nltk.download('wordnet').

If you see a list of English stop words, then the NLTK library is installed correctly.

Text Preprocessing Techniques in NLP Using NLTK

Tokenization

Tokenization is the process of breaking text into smaller parts, such as words or sentences. This is the first step in text analysis, which allows you to convert continuous text into discrete elements that can be worked on separately. This process helps in identifying key words and phrases and facilitates subsequent text analysis.

  1. In NLTK you can use it like this:

    • Splitting into words:

      import nltk
      nltk.download('punkt')

      This code will load the required data punktwhich are used to tokenize text.

      from nltk.tokenize import word_tokenize
      
      text = "NLTK упрощает обработку текста."
      word_tokens = word_tokenize(text)
      print(word_tokens)

      Code result: ['NLTK', 'упрощает', 'обработку', 'текста', '.']

    • Breakdown into sentences:

      from nltk.tokenize import sent_tokenize
      
      text = "OTUS. Наш сайт https://otus.ru/."
      sentence_tokens = sent_tokenize(text)
      print(sentence_tokens)

Tokenization is useful in tasks where you need to analyze individual words or phrases, such as identifying keywords in text, analyzing word frequency, or training machine learning models to classify text.

Removing stop words

Stop words are common words in a language that usually carry little meaning (for example, “and”, “in”, “on”). Removing them allows you to reduce the amount of data for analysis and focus on more meaningful words, which increases the accuracy and efficiency of text processing.

  1. Code examples:

    • Filtering stop words in Russian:

      Need to download data stopwords using NLTK Downloader. This is done like this:

      import nltk
      nltk.download('stopwords')
      from nltk.corpus import stopwords
      from nltk.tokenize import word_tokenize
      
      text = "NLTK помогает в удалении стоп-слов из текста."
      tokens = word_tokenize(text)
      stop_words = set(stopwords.words('russian'))
      filtered_tokens = [word for word in tokens if word not in stop_words]
      
      print(filtered_tokens)

      Result: ['NLTK', 'помогает', 'удалении', 'стоп-слов', 'текста', '.']

    • Filtering stop words in English:

      text = "NLTK helps in removing stopwords from the text."
      tokens = word_tokenize(text)
      filtered_tokens = [word for word in tokens if not word in stopwords.words('english')]
      print(filtered_tokens)
      

Stop word removal is often used in text processing tasks such as sentiment analysis, text classification, word cloud generation, and information retrieval where it is important to extract key information from text.

Stemming

Stemming is the process of reducing words to their basic (root) form by removing endings and suffixes. This helps reduce text complexity and improve the performance of analysis algorithms.

  1. Code examples:

    • Stemming in English:

      from nltk.stem import PorterStemmer
      from nltk.tokenize import word_tokenize
      
      stemmer = PorterStemmer()
      text = "The stemmed form of leaves is leaf"
      tokens = word_tokenize(text)
      stemmed_words = [stemmer.stem(word) for word in tokens]
      print(stemmed_words)
      

      Result: ['the', 'stem', 'form', 'of', 'leav', 'is', 'leaf']

    • Stemming in Russian:

      from nltk.stem.snowball import SnowballStemmer
      
      stemmer = SnowballStemmer("russian")
      text = "Листовые листочки лист листва листве почему так"
      tokens = word_tokenize(text)
      stemmed_words = [stemmer.stem(word) for word in tokens]
      print(stemmed_words)

Stemming is most useful in tasks where it is important to reduce the variety of word forms, such as text indexing for search engines, large-volume text analytics, and training machine learning models to classify or cluster texts.

Lemmatization

Unlike stemming, lemmatization reduces words to their lemma—a more complex process that takes into account the morphological analysis of words. Lemmatization processes words more accurately, reducing them to dictionary form.

  1. Code examples:

    • Lemmatization in English:

      You need to download omw-1.4 data using NLTK Downloader:

      import nltk
      nltk.download('omw-1.4')
      from nltk.stem import WordNetLemmatizer
      from nltk.tokenize import word_tokenize
      
      lemmatizer = WordNetLemmatizer()
      text = "The lemmatized form of leaves is leaf"
      tokens = word_tokenize(text)
      lemmatized_words = [lemmatizer.lemmatize(word) for word in tokens]
      print(lemmatized_words)
    • Lemmatization in Russian (using a stemmer, since NLTK does not provide a direct lemmatizer for Russian):

      stemmer = SnowballStemmer("russian")
      text = "Лемматизированная форма слова листья это лист"
      tokens = word_tokenize(text)
      lemmatized_words = [stemmer.stem(word) for word in tokens]
      print(lemmatized_words)
      

Lemmatization is important in tasks where high accuracy of text processing is required, such as machine translation, semantic text analysis and the creation of question-answer systems, where it is important to accurately understand the meaning of words in context.

Sentiment Analysis

Sentiment analysis, sometimes called “sentiment detection,” involves the use of NLP, statistical, or machine-learned algorithms to examine, identify, and extract sentiment information from texts. It can be as simple as identifying whether a review is positive or negative, or as complex as identifying more subtle emotional states such as irony or disappointment.

Sentiment analysis is not without its challenges. One of the main difficulties lies in the interpretation of sarcasm, irony and figurative language. For example, the phrase “Well, of course, I really liked it when my phone stopped working” actually expresses disappointment, although at first glance it may seem positive. Recognizing such subtleties requires advanced algorithms and, often, contextual analysis.

Sentiment analysis (or sentiment analysis) in NLTK often boils down to classifying text as positive or negative. To implement sentiment analysis, you can use different approaches.

Simple classification using pre-trained data

import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

sia = SentimentIntensityAnalyzer()
text = "NLTK is amazing for natural language processing!"
print(sia.polarity_scores(text))

Classification using custom training data

import nltk
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import subjectivity
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *

nltk.download('subjectivity')

n_instances = 100
subj_docs = [(sent, 'subj') for sent in subjectivity.sents(categories="subj")[:n_instances]]
obj_docs = [(sent, 'obj') for sent in subjectivity.sents(categories="obj")[:n_instances]]
train_subj_docs = subj_docs[:80]
test_subj_docs = subj_docs[80:100]
train_obj_docs = obj_docs[:80]
test_obj_docs = obj_docs[80:100]
training_docs = train_subj_docs+train_obj_docs
testing_docs = test_subj_docs+test_obj_docs

sentim_analyzer = SentimentAnalyzer()
all_words_neg = neg_tagged_word_feats(sentim_analyzer.all_words(training_docs), mark_negation=True)
unigram_feats = sentim_analyzer.unigram_word_feats(all_words_neg, min_freq=4)
sentim_analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_feats)
training_set = sentim_analyzer.apply_features(training_docs)
test_set = sentim_analyzer.apply_features(testing_docs)

trainer = NaiveBayesClassifier.train
classifier = sentim_analyzer.train(trainer, training_set)

for key,value in sorted(sentim_analyzer.evaluate(test_set).items()):
    print('{0}: {1}'.format(key, value))

Using TextBlob for Sentiment Analysis

from textblob import TextBlob
import nltk

nltk.download('movie_reviews')
nltk.download('punkt')

text = "I love NLTK. It's incredibly helpful!"
blob = TextBlob(text)
print(blob.sentiment)

Sentiment analysis using tokenizer and stopword list

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk

nltk.download('stopwords')
nltk.download('vader_lexicon')

stop_words = set(stopwords.words('english'))
text = "NLTK is not bad for learning NLP."
filtered_text=" ".join([word for word in word_tokenize(text) if not word in stop_words])

sia = SentimentIntensityAnalyzer()
print(sia.polarity_scores(filtered_text))

Combining Lemmatization and Sentiment Analysis

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk

nltk.download('wordnet')
nltk.download('vader_lexicon')

lemmatizer = WordNetLemmatizer()
text = "The movie was not good. The plot was terrible!"
lemmatized_text=" ".join([lemmatizer.lemmatize(word) for word in word_tokenize(text)])

sia = SentimentIntensityAnalyzer()
print(sia.polarity_scores(lemmatized_text))

It is important to note that for some examples to work correctly, you may need to install additional libraries, such as textblob.

The Bag of Words (BoW) model is the primary method for representing text data in Natural Language Processing (NLP). It converts text into a numeric vector, where each word in the text is represented by the number of its occurrences.

BoW model

In the BoW model, a text (for example, a sentence or a document) is represented as a “bag” of its words, without taking into account grammar and word order, but maintaining multiplicity. This conversion of text into a set of numbers allows the use of standard machine learning techniques that work on numeric data.

Each unique word in the text corresponds to a specific index (or “slot”) in the vector. If a word occurs in the text, then the number of its occurrences is recorded in the corresponding vector slot. For example, the text “apple banana apple” will turn into a vector [2, 1]if index 0 corresponds to the word “apple” and index 1 corresponds to the word “banana”.

In sentiment analysis, the BoW model is used to transform text data into a format suitable for machine learning algorithms. Thus, text data (for example, user reviews) is converted into numeric vectors on which classifiers can be trained to determine, for example, positive or negative attitudes.

Creating BoW with NLTK and using it for classification

import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

nltk.download('punkt')

# Пример данных
texts = ["I love this product", "This is a bad product", "I dislike this", "This is the best!"]
labels = [1, 0, 0, 1]  # 1 - позитивный, 0 - негативный

# Токенизация
tokens = [word_tokenize(text) for text in texts]

# Создание BoW модели
vectorizer = CountVectorizer()
bow = vectorizer.fit_transform([' '.join(token) for token in tokens])

# Разделение данных на обучающую и тестовую выборку
X_train, X_test, y_train, y_test = train_test_split(bow, labels, test_size=0.3)

# Обучение классификатора
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Оценка классификатора
predictions = classifier.predict(X_test)
print("Accuracy:", accuracy_score(y_test, predictions))

Example 2: Simple Sentiment Analysis with BoW and NLTK

import nltk
from nltk.corpus import movie_reviews
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

nltk.download('movie_reviews')

# Загрузка данных
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

# Подготовка данных
texts = [' '.join(doc) for doc, _ in documents]
labels = [1 if category == 'pos' else 0 for _, category in documents]

# Создание BoW модели
vectorizer = CountVectorizer()
bow = vectorizer.fit_transform(texts)

# Разделение данных на обучающую и тестовую выборку
X_train, X_test, y_train, y_test = train_test_split(bow, labels, test_size=0.3)

# Обучение классификатора
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Оценка классификатора
predictions = classifier.predict(X_test)
print("Accuracy:", accuracy_score(y_test, predictions))

BoW serves as a bridge between textual data and numerical methods.


NLTK’s ease of use makes it an ideal choice for a wide range of text processing tasks.

You can find out more about NLTK Here.

And you can get practical analytics skills from industry experts through online courses from my colleagues from OTUS. More details in the catalogue.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *