From Text to Summary: Sumy Library

Let's install:

pip install sumy

Main features of Sumy:

  1. Sumy supports several summarization methods including LSA, TextRank and LexRank.

  2. With a minimal amount of code, you can start summarizing text fairly quickly.

  3. Sumy integrates easily with other Python libraries.

  4. In addition to English, Sumy also provides support for Russian.

Basic syntax

Before you start working, you need to import the necessary modules. The basic template that is used in most cases:

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.summarizers.text_rank import TextRankSummarizer

Parsers convert text into a format that can be used for summarizing.

Sumy has four main parsers available for working with text:

  1. PlaintextParser: for processing simple texts and strings. Good for working with regular text files.

  2. HtmlParser: for extracting text from HTML documents and web pages.

  3. JsonParser: for processing text from JSON structures.

  4. DocxParser: to extract text from Microsoft Word documents.

ExamplePlaintextParser:

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer

text = """Текст, который вы хотите резюмировать. Он может быть довольно длинным и содержать множество предложений."""
parser = PlaintextParser.from_string(text, Tokenizer("russian"))

To work with HTML content:

from sumy.parsers.html import HtmlParser

url = "http://example.com"
parser = HtmlParser.from_url(url, Tokenizer("russian"))

Tokenizers divide the text into sentences and words. For Russian, it is most often used Tokenizer("russian"). However, you can use standard NLTK tokenizers for more control:

from sumy.nlp.tokenizers import Tokenizer

tokenizer = Tokenizer("russian")

The application of algorithms looks like this, let me remind you that these are supported: Latent Semantic Analysis, TextRank, LexRank and LsaSummarizer.

LSA:

from sumy.summarizers.lsa import LsaSummarizer

summarizer = LsaSummarizer()
summary = summarizer(parser.document, 3)  # получить 3 предложения

for sentence in summary:
    print(sentence)

TextRank:

from sumy.summarizers.text_rank import TextRankSummarizer

summarizer = TextRankSummarizer()
summary = summarizer(parser.document, 3)  # получить 3 предложения

for sentence in summary:
    print(sentence)

You can also specify the number of sentences you want to receive in the resume. This is done by calling the summator method:

summary = summarizer(parser.document, 2)  # получить 2 предложения

Application examples

Summarizing news articles

Let's say you want to create short summaries for news articles so you can quickly get a glimpse of the day's main events. Sumy fits this perfectly, as it allows you to extract key sentences and ideas from long texts:

from sumy.parsers.html import HtmlParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer

# URL новостной статьи
url = "https://example-news-site.com/article"

# инициализация парсера и суммаризатора
parser = HtmlParser.from_url(url, Tokenizer("english"))
summarizer = LsaSummarizer()

# генерация резюме
summary = summarizer(parser.document, 3)  # получить 3 предложения

print("Краткое резюме статьи:")
for sentence in summary:
    print(sentence)

We use HtmlParser to download a news article by URL and LsaSummarizer to create a three-sentence summary.

Report Analysis

There is often a need to quickly review long reports or the same research papers.

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer

# исходный текст научного отчета
text = """В этом исследовании мы анализируем влияние изменения климата на экосистемы.
Используются различные методы для оценки изменений температуры и уровня осадков.
Результаты показывают значительное влияние на растительность и животный мир."""

# инициализация парсера и суммаризатора
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = TextRankSummarizer()

# генерация резюме
summary = summarizer(parser.document, 2)  # получить 2 предложения

print("Краткое резюме отчета:")
for sentence in summary:
    print(sentence)

We use PlaintextParser to work with the text content of a scientific report and TextRankSummarizer to obtain a short summary containing two sentences.

Processing user feedback

Sumy allows you to extract general opinions and identify key trends in reviews.

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer

# исходные отзывы пользователей
text = """Этот продукт превосходен! Он оправдал все мои ожидания.
Мне не понравилось качество упаковки, но сам продукт отличного качества.
Рекомендую всем, кто ищет надежный товар."""

# инициализация парсера и суммаризатора
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = LexRankSummarizer()

# генерация резюме
summary = summarizer(parser.document, 2)  # получить 2 предложения

print("Краткое резюме отзывов:")
for sentence in summary:
    print(sentence)

We use LexRankSummarizer to process review texts and create a two-sentence summary.


More details with the library You can read it here.

And finally, I want to invite you to free webinarwhere you will learn what RAG is, why it is needed in NLP services and in what areas this technology is used. In addition, we will consider the types of RAG, methods for assessing the quality of a RAG service and a practical example on the Question Answering (QA) task.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *