Is there a difference in attention between a human and a transforming model?

In order to understand language and draw various conclusions, humans reason based on world knowledge and common sense. Although large language models have made significant advances in natural language processing, common sense reasoning remains one of the most difficult skills.

The most common way to assess the common sense reasoning abilities of models is the Winograd test (The Winograd Schema Challengeor WSC), named after Terry Winograd, a professor of computer science at Stanford University. The test is based on resolving syntactic ambiguities.

Let's look at an example from Vinograd's diagram:

“The cup doesn't fit in the brown suitcase because He too big.

What is too big in this case: a suitcase or a cup? For a person the answer is obvious, but for a model?..

We will tell you about our study, in which we compared human and model attention, and also analyzed which words a person and a model pay attention to when solving Vinograd's scheme. Although human attention and transformer attention seem completely different, individual results indicate a relationship between them.

Human attention and transformers attention

Attention is a phenomenon associated with both reading in humans and natural language processing in large language models. In both cases, attention is used to find important information needed for various reasoning and inferences.

In this regard, the question arises about how human attention and models are interconnected, and whether they have common patterns when processing text.

For such an analysis, it is necessary to understand what can be used to describe human attention and the transformer model.

As is known, an important detail of modern language models based on the “transformer” architecture is the mechanism self-attentionwhich consists of finding relationships between input data tokens. You can read more about how models based on this architecture and the self-attention mechanism are structured here.

In short, self-attention is about determining which words are important to each other in a given context. That is, which words should I pay attention to now.

Encoder and Matrix Calculation of Self-Attention. Source: https://jalammar.github.io/illustrated-transformer/

Encoder and Matrix Calculation of Self-Attention. Source: https://jalammar.github.io/illustrated-transformer/

In our study, we will take pre-trained weights from self-attention layers of encoders as a characteristic of attention of models.

Now the question arises, how to describe human attention? This is where explicit (visual) attention comes in, which can be described by observing eye movements. Eye movement data is often used to study the cognitive processes of the brain associated with reading. In our study, we used data on the duration and number of fixations of the gaze on individual words during reading and task performance to describe human attention.

Information about gaze movements is collected using special equipment, a procedure called videooculography. The participant in the study sits in front of a screen, and a videooculograph records eye movements.

Videooculography

Videooculography

So, we have figured out what we are going to use to describe such a phenomenon as human attention and transformer model attention. Now we need to put forward hypotheses that we want to test, namely:

  • Is there a relationship between human attention and transformer attention when solving the Vinograd scheme?

  • Does this relationship change depending on the model layer?

  • For the model and for a person, is the attention indicator the same for the whole context or is it higher for individual words?

If there is a relationship between the attention of the transformer model and human visual attention, it is possible to hypothesize the possibility of using eye movement information to approximate the attention of transformers.

Before moving on to the main experiments, let's consider the problem based on the Winograd scheme in more detail and understand the terminology.

Statement of the problem

Winograd's scheme is based on coreference resolution in texts. The task of coreference resolution is to identify expressions in a text that refer to the same entity. In this case, the task is a binary classification in which it is necessary to determine whether a pronoun (anaphor) refers to an antecedent (a word or phrase).

In the example above, the pronoun “he” is the anaphora and the noun “cup” is the antecedent.

Data

For the experiment, 148 unique sentences from the dataset were selected. Wineryincluded in the TAPE benchmark. Each sentence contains an anaphoric pronoun, to which two questions are asked: about the correct and incorrect antecedent. Thus, the final data set EyeWino consists of 296 sentence-question pairs.

Example with a question about the correct antecedent:

text: It should be noted that the Bison possessed the highest art of experimentation – he knew how to ask nature questions to which it had to answer yes or no.
question: Is meant by the highlighted pronoun?
Anafor: He
antecedent: Bison
answer: Yes
label: 1

The same example, but with a question about the incorrect antecedent:

text: It should be noted that the Bison possessed the highest art of experimentation – he knew how to ask nature questions to which it had to answer yes or no.
question: Is meant by the highlighted pronoun?
Anafor: He
antecedent: experimenter
answer: No
label: 0

Experiments

Human attention

Eye movement data collection

The eye movement data was collected by the professor's research group Olga Dragoy HSE University. Each participant in the experiment was shown a sentence with an anaphoric pronoun on the screen, which was highlighted in red. After each sentence, there was a question about the supposed antecedent of the anaphor. There were from 38 to 50 people for each sentence. A total of 100 people took part in the experiment.

Participants were required to read a sentence containing an anaphoric pronoun and answer a question about the antecedent using a keyboard: key 1 if they agreed with the proposition in the question, key 0 if they disagreed. Video-oculography was used to track eye movements during the experiment.

Example:

It should be noted that the Bison possessed the highest art of the experimenter – he knew how to ask nature questions to which it had to answer yes or no. Is the highlighted pronoun referring to the Bison? If you think yes, press 1 on the keyboard; if you think no, press 0 on the keyboard.

The experiment resulted in observations of eye movements, which represent the coordinates of a person's gaze fixations on the screen.

Characteristics of attention

Based on the obtained coordinates, various characteristics of human attention were calculated, such as the duration and number of fixations of the study participants on individual words while completing the task.

The following were used as target characteristics of human attention:

  • Fixations — total number of fixations;

  • Total reading time — the total duration of reading a word;

  • Gaze duration — the duration of the first reading of a word.

The observations were normalized by calculating the proportion of attention given to each word in the sentence by an individual. The results were averaged for each word across all participants.

Attention transformers

Models

For our study, we used different transformer architectures with pre-trained checkpoints.

Pre-trained models for Russian language:

  • ruBERT-base: encoder-only, 178M parameters, 12 layers, 12 heads

  • ruRoBERTa-large: encoder-only, 355M parameters, 24 layers, 16 heads

  • ruT5-base: encoder-decoder, 222M parameters, 12 layers, 12 heads

Multilingual models:

  • mBERT-base: encoder-only, 178M parameters, 12 layers, 12 heads

  • XLM-R-large: encoder-only, 560M parameters, 24 layers, 16 heads

  • mT5-base: encoder-decoder, 222M parameters, 12 layers, 12 heads

Data for training

Each of the checkpoints was additionally trained on the WSC task using various open data. Two sets in Russian were used for additional training of all models: TAPE: Wine And MERA:RWSD.

To further train the multilingual models, we additionally used XWINO — a multilingual set containing Vinograd's diagrams in 6 languages.

The collected data were brought to a given format and filtered. Examples in which anaphora is not a pronoun were discarded from all sets. Chinese and Japanese were excluded from the multilingual set, since it is not possible to apply uppercase to them, as well as Russian, since it duplicates data from RWSD. Also, duplicates were excluded from all sets, if any.

As a result of filtering, the following samples were included in the final training set:

Language

Train

Validation

English

2846

1216

French

108

56

Portuguese

358

116

English

872

326

Data format

The input format repeated the human experiment, except that the anaphoric pronoun was highlighted in capital letters rather than in red:

text = text.format(anafor=anafor.upper())

question = question.format(antecedent=antecedent)

input_text = text + question

Example:

It should be noted that the Bison possessed the highest art of experimentation – HE knew how to ask nature questions to which it had to answer yes or no. Is the Bison meant by the highlighted pronoun?

Tuning of T5 models was done using the head for language modeling, for encoder architectures (BERT, RoBERTa) the head for classification was used, and [SEP] a token between the original sentence and the question that follows it.

Characteristics of attention

As mentioned above, we decided to use pre-trained weights from self-attention layers of encoders as a description of the attention of the models. Mechanism self-attention represents:

Self-attention formula in transformer architecture

Self-attention formula in transformer architecture

We used the output of the softmax function from the self-attention layers of the encoder before they were multiplied by the matrix V to get the proportion of attention that each token paid to all the others.

The attention value for each token was calculated by averaging the vectors across heads and rows of the matrix. Special tokens were used to calculate attention, but were excluded from the final attention vector. The resulting token weights were summed across individual words in a sentence to obtain the attention value for individual words. Attention was also normalized across sentences by calculating the proportion of attention given to each word.

As a result obtained one vector representing the amount of attention given to each word by all other words in the sentence.

Results

Correlations

To calculate the statistical relationship between the obtained normalized assessments of human attention and transformer attention, Spearman's correlation was used. This coefficient varies from -1 to 1, with 0 indicating no correlation. Positive correlations imply that an increase in one variable is followed by an increase in the other.

Table 1. Maximum values ​​of correlation coefficients Spearman's regression between human attention and encoder attention. Layer – the layer of the model where the maximum correlation is observed.

Model

Layer

Gaze duration

Fixations

Total reading time

ruBERT-base, pre-trained

1

0.606

0.601

0.592

ruBERT-base, tuned

1

0.608

0.603

0.595

mBERT-base, pre-trained

1

0.683

0.684

0.674

mBERT-base, tuned

1

0.683

0.684

0.673

ruRoBERTa-large, pre-trained

16

0.551

0.543

0.538

ruRoBERTa-large, tuned

16

0.555

0.542

0.542

XLM-R-large, pre-trained

14

0.605

0.592

0.593

XLM-R-large, tuned

17

0.595

0.588

0.582

ruT5-base, pre-trained

1

0.605

0.593

0.587

ruT5-base, tuned

1

0.606

0.594

0.588

mT5-base, pre-trained

9

0.63

0.619

0.611

mT5-base, tuned

1

0.58

0.573

0.561

Based on the results presented in the table, we can conclude that there is a moderate correlation between the attention of transformers and human attention. The highest correlation values ​​for most checkpoints are observed with the indicator gaze duration — the duration of the first reading of a word.

The value of Spearman correlation coefficients between human attention and model attention on different layers. Pre-trained — checkpoints without tuning on the Winograd scheme, tuned — checkpoints after tuning on the Winograd scheme.

The value of Spearman correlation coefficients between human attention and model attention on different layers. Pre-trained — checkpoints without tuning on the Winograd scheme, tuned — checkpoints after tuning on the Winograd scheme.

From the graphs above, the following conclusions can be drawn:

  • For a given checkpoint, the model shows similar correlations with different characteristics of human attention.

  • For each Russian-language model, similar correlations are observed for both tunes and pretrains on most layers. The differences in correlations are insignificant.

  • For the multilingual mT5-base model, the pretrain trend is not preserved for tunes.

  • The difference in correlations between multilingual and Russian-language models for the same architecture may be due to shifts in the data distribution. In the corpus for multilingual tuning, 74% of the data is presented in English, while the test sample consists only of texts in Russian.

Visualization of “important” words

Let's see what words a person and a model pay attention to using the example of ruRoBERTa-large. From Table 1 it follows that high correlations are observed at the 16th pretrain layer with the first reading time indicator.

To identify important words, for each sentence we selected 25% of words with the highest attention weight.

Let's look at examples with one sentence but different questions. In the example below, the words that the person considers most important are highlighted:

answer “No”

IN bulging in her eyes the pride of the breadwinner shone, WHO fulfilled your own purpose and through that I found harmony with myself and with the world. Is this what is meant by the highlighted pronoun? pride?

answer “Yes”

IN bulging in her eyes the pride of the breadwinner shone, WHO fulfilled your own purpose and through that I found harmony with myself and with the world. Is this what is meant by the highlighted pronoun? breadwinners?

Important words for ruRoBERTa-large, pre-trained, layer 16:

answer “No”

IN bulging eyes She was beaming with the pride of a breadwinner who had fulfilled its purpose and through that it found harmony with yourself and with the world Does the highlighted pronoun mean pride?

answer “Yes”

IN bulging eyes She was beaming with the pride of a breadwinner who had fulfilled its purpose and through that found harmony with itself and with the world Is it meant by the highlighted pronoun breadwinners?

It can be concluded that in this case, the person and the model pay attention to similar words, namely the pronoun and the verb that refers to this pronoun, as well as both antecedents in the text itself (“pride” and “breadwinners”). The difference is that the model does not pay special attention to the incorrect antecedent in the question, unlike the person.

Integrating Human Attention into Transformer Learning

The analysis revealed significant correlations between human attention and the attention of models in the anaphora resolution task. It can be assumed that using eye movement data when training models for this task will help improve their accuracy. Based on this, we conducted experiments on integrating eye movement data into the model training process, based on the approach proposed in Bensemann et al. (2022).

We added an extra term to the loss function to bring the model's attention closer to human attention:

Loss function during training

Loss function during training

H(y, yˆ) is the cross-entropy, which measures the performance of the model in the anaphora resolution task.

H(p,pˆ) is the cross-entropy, which measures the difference between two probability distributions: the distribution of the model's attention values ​​at a particular layer and the distribution of the relative importance of human words.

We use the collected eye movement dataset as training data, and the set R.W.S.D. from MERA – as test data.

Table 2. Accuracy values ​​of models in the anaphora resolution task with the integration of human attention in the model training process. Layer — the layer of the model where the maximum correlation is observed.

Model

Layer

Without integration

Gaze duration

Fixations

Total reading time

ruBERT-base, pre-trained

1

55.77

56.15

55.77

55.77

ruBERT-base, tuned

1

58.08

58.08

58.08

58.08

mBERT-base, pre-trained

1

54.62

53.08

56.15

56.15

mBERT-base, tuned

1

56.15

55.38

56.92

55,00

ruRoBERTa-large, pre-trained

16

56.92

53.85

58.08

56.92

ruRoBERTa-large, tuned

16

55.38

55.38

55.38

55.77

XLM-R-large, pre-trained

14

55,00

50.38

55.38

54.23

XLM-R-large, tuned

17

54.62

53.46

52.31

56.92

ruT5-base, pre-trained

1

52.69

56.54

61.15

57.31

ruT5-base, tuned

1

55.77

54.62

54.62

52.31

mT5-base, pre-trained

9

53.08

54.23

53.08

54.62

mT5-base, tuned

1

58.46

58.08

58.85

59.23

Based on the results, it can be concluded that the use of human attention, described by the Fixations and Reading time metrics, improves the performance of most models. However, the results did not show a significant increase in accuracy, indicating the need for further research on how to incorporate human attention into the training of transformers.

Conclusion

We did this work in collaboration with colleagues from the HSE, the Neurolinguistics Lab, and Olga Draga's team. In the article, we compared how people and models based on the “transformer” architecture perceive and process information in terms of attention when solving the Vinograd scheme.

Article was presented at the ICML Master Class of the International ACL Conference in Bangkok. This kind of fundamental research opens up new possibilities for the study of transformers and, as a result, the improvement of applications based on them. Human attention can potentially be used to approximate the attention of a transformer in order to reduce computational costs.

Subscribe to the Telegram channel SaluteAIin which my colleagues and I share interesting materials and developments in machine learning.

Authors

AGI NLP team: Anastasia Kozlova, Albina Akhmetgareeva (@Colindonolwe), Alena Fenogenova (@alenusch)

Thanks to colleagues from the National Research University Higher School of Economics, Professor Olga Dragoy, Aigul Khanova, Semyon Kudryashov for creating the eye-tracking set.

References to literature

Used sets based on the Vinograd scheme:

Similar works:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *