Making transformers answer questions

NTA professional community.

Intelligent systems are designed to make life easier for a person by performing routine tasks for him. One of these tasks is the search for information in a large amount of text. Is it possible to transfer this task to the shoulders of intelligent systems? I decided to ask this question.

Content

How did it start

I was faced with the task of searching in a pile of economic contracts for answers to questions of interest to us. If there were several documents, then it would not be difficult to view them manually, but their number reached hundreds, each of which had 50 or more pages, while the necessary information could be in only one.

And it’s not even about the amount of time spent – due to banal fatigue, you can miss the necessary information. To avoid such problems, it was decided to use the power of machine learning, namely question-answer systems.

Such systems allow for a question asked by the user in natural language to give an answer based on some text. A particular example of such a system is the sensational ChatGPT. For the current task, such a powerful system is redundant, so I will make something simpler.

Simpler question-and-answer systems do not know how to generalize the answer to a question, but rather try to find the words that best fit the meaning of the question. For example, this text:

Yesterday there was a lot of snow, but today, due to warm weather, it has melted and froze. We’ll have to take a crowbar on the weekend and break the ice a little.

If the system is asked the question “How many commas are there in the text?”, Then it will not be able to answer it, since it will not “guess” to count them, but if you ask it: “When to remove the ice?”, Then it will unequivocally answer: “at the weekend “, since the word “ice” occurs only in this sentence and the phrase “at the weekend” refers to the time.

Model selection

I will take the model for the current task already trained from the library transformers. This library provides a convenient interface for loading, retraining and using more than 20,000 pre-trained models for processing text, images and audio. Models provided by the machine learning community Hugging faceand a convenient search for suitable models is organized on their platform.

Also, the transformers library supports the interaction of frameworks PyTorch, TensorFlow And JAXthat is, you can pass data to the input of the model in one framework, and get the results in another.

To select a model suitable for the task, I set the appropriate filters on the Hugging Face platform: the model type is question-answer, the language is Russian. Three models fall under such criteria, I will try them all:

There are two ways to download and use models. In the first one, all actions are carried out through the pipeline class, which is fast and convenient. A small example of using this class is below:

from transformers import pipeline

question = 'Когда будем убирать лёд?'
context = ('Вчера выпало много снега, а сегодня из-за'
         ' тёплой погоды он подтаял и заледенел.'
         ' Придётся на выходных взять лом и немного разбить лёд.')

model_pipeline = pipeline(
   task='question-answering',
   model="timpal0l/mdeberta-v3-base-squad2"
)

model_pipeline(question=question, context=context)

Result:

{'score': 0.6851194500923157,
 'start': 88,
 'end': 100,
 'answer': ' на выходных'}

In the second method, the model with the tokenizer is downloaded separately, the data is processed independently and fed to the input of the model. This gives more flexibility compared to the first method. In the current task, I will use only the second one, since the amount of text is too large and needs to be processed in a special way, more on that below.

Program code

In order not to complicate the code, I will give a solution for only one model, the rest will work according to a similar algorithm.

First, I include the necessary libraries, and also download the model and tokenizer for them:

import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("timpal0l/mdeberta-v3-base-squad2")

model = AutoModelForQuestionAnswering.from_pretrained("timpal0l/mdeberta-v3-base-squad2")

Now you should use the tokenizer to process the question along with the text. Here I will add a special flag so that the tokenizer does not insert special characters into the tokens, such as a character to separate sentences. To demonstrate the operation of the system, I will use the first volume of Leo Tolstoy’s work “War and Peace” as a text:

question = 'Кто приехал в гости к Анне Павловне?'

tokenized = tokenizer.encode_plus(
   question, text,
   add_special_tokens=False
)

In order to be able to display the answer in natural language in the future, I extract symbolic tokens in advance:

tokens = tokenizer.convert_ids_to_tokens(tokenized['input_ids'])

Next, you need to pack the tokens in a special way. The fact is that the selected models are based on a pre-trained model BERT, until recently one of the most advanced in the field of natural language processing, but the resources for its work increase by an order of magnitude with an increase in the number of processed words. Therefore, I break all the tokenized text into blocks of the same length and submit it to the input of the model. Each block should begin with a question, and in addition, to increase efficiency, I will add an overlay of the last tokens of each block to the next.

First, I will set the length of each block and the length of the overlay:

# Общая длина каждого блока
max_chunk_length = 512
# Длина наложения
overlapped_length = 30

The token_type_ids element in the tokenizer marks the question with zeros, and the body text with ones, so to calculate the length of the question in tokens, you need to count the number of zeros in this element:

# Длина вопроса в токенах
answer_tokens_length = tokenized.token_type_ids.count(0)
# Токены вопроса, закодированные числами
answer_input_ids = tokenized.input_ids[:answer_tokens_length]

The first block will be without overlay, so I will first collect the remaining blocks with overlay, and then add the first one to them. I will calculate the length of the blocks without question and overlay:

# Длина основного текста первого блока без наложения
first_context_chunk_length = max_chunk_length - answer_tokens_length
# Длина основного текста остальных блоков с наложением
context_chunk_length = 
 max_chunk_length - answer_tokens_length - overlapped_length

Next, I will separate the first block and process the remaining ones (the code is under the spoiler):

Code
# Токены основного текста
context_input_ids = tokenized.input_ids[answer_tokens_length:]
# Основной текст первого блока
first = context_input_ids[:first_context_chunk_length]
# Основной текст остальных блоков
others = context_input_ids[first_context_chunk_length:]

# Если есть блоки кроме первого
# тогда обрабатываются все блоки
if len(others) > 0:
  # Кол-во нулевых токенов, для выравнивания последнего блока по длине
  padding_length = context_chunk_length - (len(others) % context_chunk_length)
  others += [0] * padding_length

  # Кол-во блоков и их длина без добавления наложения
  new_size = (
      len(others) // context_chunk_length,
      context_chunk_length
  )

  # Упаковка блоков
  new_context_input_ids = np.reshape(others, new_size)

  # Вычисление наложения
  overlappeds = new_context_input_ids[:, -overlapped_length:]
  # Добавление в наложения частей из первого блока
  overlappeds = np.insert(overlappeds, 0, first[-overlapped_length:], axis=0)
  # Удаление наложение из последнего блока, так как оно не нужно
  overlappeds = overlappeds[:-1]

  # Добавление наложения
  new_context_input_ids = np.c_[overlappeds, new_context_input_ids]
  # Добавление первого блока
  new_context_input_ids = np.insert(new_context_input_ids, 0, first, axis=0)

  # Добавление вопроса в каждый блок
  new_input_ids = np.c_[
    [answer_input_ids] * new_context_input_ids.shape[0],
    new_context_input_ids
  ]
# иначе обрабатывается только первый
else:
  # Кол-во нулевых токенов, для выравнивания блока по длине
  padding_length = first_context_chunk_length - (len(first) % first_context_chunk_length)
  # Добавление нулевых токенов
  new_input_ids = np.array(
    [answer_input_ids + first + [0] * padding_length]
  )

After the blocks of tokens are received, I form masks to separate the question from the text and to “attention” the model (the code is under the spoiler):

Code
# Кол-во блоков
count_chunks = new_input_ids.shape[0]

# Маска, разделяющая вопрос и текст
new_token_type_ids = [
  # вопрос блока
  [0] * answer_tokens_length
  # текст блока
  + [1] * (max_chunk_length - answer_tokens_length)
] * count_chunks

# Маска "внимания" модели на все токены, кроме нулевых в последнем блоке
new_attention_mask = (
  # во всех блоках, кроме последнего, "внимание" на все слова
  [[1] * max_chunk_length] * (count_chunks - 1)
  # в последнем блоке "внимание" только на ненулевые токены
  + [([1] * (max_chunk_length - padding_length)) + ([0] * padding_length)]
)

Now I wrap the blocks and masks in tensor and feed the model as input:

# Токенизированный текст в виде блоков, упакованный в torch
new_tokenized = {
 'input_ids': torch.tensor(new_input_ids),
 'token_type_ids': torch.tensor(new_token_type_ids),
 'attention_mask': torch.tensor(new_attention_mask)
}

outputs = model(**new_tokenized)

Finally, I calculate the most probable positions of the beginning and end of the answer, extract the words from the previously received tokens, remove auxiliary characters from them and display the answer (the code and the answer are under the spoiler):

Code and response
# Позиции в 2D списке токенов начала и конца наиболее вероятного ответа
# позиции одним числом
start_index = torch.argmax(outputs.start_logits)
end_index = torch.argmax(outputs.end_logits)

# Пересчёт позиций начала и конца ответа для 1D списка токенов
# = длина первого блока + (
#   позиция - длина первого блока
#   - длина ответов и отступов во всех блоках, кроме первого
# )
start_index = max_chunk_length + (
  start_index - max_chunk_length
  - (answer_tokens_length + overlapped_length)
  * (start_index // max_chunk_length)
)
end_index = max_chunk_length + (
  end_index - max_chunk_length
  - (answer_tokens_length + overlapped_length)
  * (end_index // max_chunk_length)
)

# Составление ответа
# если есть символ начала слова '▁', то он заменяется на пробел
answer="".join(
  [t.repace('▁', ' ') for t in tokens[start_index:end_index+1]]
)

print('Вопрос:', question)
print('Ответ:', answer)

Result:

Вопрос: Кто приехал в гости к Анне Павловне?
Ответ: высшая знать Петербурга

GUI

To make it really good, I’ll make a small graphical interface. For this I will use the library PySimpleGUI. The interface will consist of a text field for entering the main text, a question on this text, a button for launching and a field for displaying the answer. The code to create all this is presented below. All of the above code for finding an answer is packaged in the question_answer function, which has the text as the first argument and the question as the second:

Code
import PySimpleGUI as sg


# Создание и укладка элементов окна
layout = [
   [sg.Text('Текст', key='-Text-label-')],
   [sg.Multiline('', key='-Text-', expand_x=True, expand_y=True)],
   [sg.Text('Вопрос', key='-Question-label-')],
   [sg.Input('', key='-Question-')],
   [sg.Button('Получить ответ')],
   [sg.Text('Ответ', key='-Answer-label-', visible=False)],
   [sg.Text('', key='-Answer-', font=('Arial Bold', 13), visible=False)],
]
# Создание окна
window = sg.Window('', layout, resizable=True, size=(700, 700), finalize=True)

# Обработка событий окна, пока оно не будет закрыто
while True:
   event, values = window.read()
   # Событие закрытие окна
   if event == sg.WINDOW_CLOSED:
       break
   # Событие при нажатии на кнопку для 'Получить ответ'
   elif event == 'Получить ответ':
       window['-Answer-label-'].update(visible=True)
       window['-Answer-'].update(
           question_answer(values['-Text-'], values['-Question-']),
           visible=True
       )

window.close()

As a result, the interface looks like this:

Comparison of results

Now I will compare the work of all previously selected models on several questions:

No. p.p.

Question

Excerpt from the book with the correct answer

Model mdeberta-v3-base-squad2

Model xlm-roberta-large-qa-multilingual-finedtuned-en

Model model-QA-5-epoch-EN

1

Who is related to Montmorency?

Viscount Mortemart, he is related to Montmorency through the Rogans

Viscount Mortemar

Mortemar, he

By the way, – Viscount Mortemar

2

Where has Pierre been since the age of ten?

Pierre, from the age of ten, was sent abroad with the tutor-abbot, where he stayed until the age of twenty.

with the tutor-abbot abroad

abroad, where he stayed until the age of twenty

was sent with the tutor-abbot

3

What did Dolokhov make a bet with the Englishman Stevens?

Dolokhov makes a bet with the Englishman Stevens, a sailor who was here, that he, Dolokhov, will drink a bottle of rum, sitting on the window of the third floor with his legs down.

drink a bottle of rum, sitting on the window

Dolokhov, drink a bottle of rum

that he, Dolokhov, will drink a bottle

4

Where is the estate of Prince Nikolai Andreevich Bolkonsky located?

In the Bald Mountains, the estate of Prince Nikolai Andreevich Bolkonsky

In the Bald Mountains

In the Bald Mountains, the estate of the prince

Bald Mountains

5

Who came to visit Anna Pavlovna?

paragraph in the second part

the highest nobility of St. Petersburg

people are the most diverse

Prince Hippolyte

6

What was the name of the son of Count Bezukhov?

Paragraph in the second part (answer: Pierre)

fat young man

young man

illegal

As you can see from the comparison, models respond best to questions with a clear answer, especially if the answer is located next to the keywords from the question. So, for example, in the 5th question, all models answered, in a way, correctly, and the 6th question is more generalized, since in the text the phrase “son of Count Bezukhov” and the word “Pierre” do not meet close enough, so the models could not find the exact answer.

If we compare the effectiveness of the models, then of all the three, in my opinion, the first model did the best, so it was it that was used in practice to solve its problem.

Conclusion: application and further development

As a result, a code was obtained for installing and using pre-trained question-answer models from the Hugging Face platform for a large amount of text. The main goal of its creation was to automate the analysis of arbitrary text by composing questions in natural language, as opposed to searching for the necessary information by keywords.

Further development of the tool is also possible, such as the use of other models. In view of the fact that the task is to process text in Russian, the selection of a suitable model is somewhat more complicated, since most of the models are designed for English. The way out can be additional training of the model on their own practical texts. So you can catch two birds with one stone: both Russify the model, and better fit it to specific texts, which should increase the results in the field of its further use.

There is also an option to use available GPT3 analogues as a model, such as, for example, Alpaca or LLaMA. These models are already an order of magnitude more perfect than those presented here and their results should also be better, but they, in turn, require much more computing power even for use, not to mention additional training.

Even so, the question-and-answer system presented here can already be used in practice. First of all, it is suitable for those who often have to work with large amounts of text to find information. For example, to find the order of actions in specific situations according to regulatory documents, to receive short and quick answers according to the minutes of the meeting, or to search for the price of a product in a pricing document.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *