LOCOST and SPECTRUM, two approaches to summation

2-3 paragraphs is the usual size of input text for language models. More is hard because the computational complexity increases quadratically. Therefore, the battle for lengthening the context continues and new, general or not so general, approaches are constantly emerging. In this review, we will talk about two approaches related to summarizing large text. First – LOCOST — aimed at long texts (articles and entire books). Second – SPECTRUM – for long dialogues.

LOCOST

This is an encoder-decoder architecture, but based on a state space model. (wrote a little about this in our channel). In general, the very development of SSM (state space models) in the era of total dominance of transformers is motivated precisely by the fact that they are able to work with a context several orders of magnitude longer, and their complexity is linear. So far, SSM-based architectures have used either only a decoder or only an encoder. In the first case – for unconditional autoregressive generation, in the second – for sequence classification. Generating conditional text, for example, compiling a summary using SSM, has not yet shown excellent results.

LOCOST (article) is aimed at exactly this. The authors propose an SSM-based encoder-decoder architecture for text summarization. It seems that it was possible to compile a short retelling of the entire book for as much as 600 thousand tokens

So, instead of the attention mechanism, we will use a state space model. Hidden states and output are specified through a system of recurrent relations. The connection between the previous and next states is linear, so you can expand the entire recurrent chain to the exit in one move using convolution. In general, ordinary convolution will give the same quadratic complexity as the transformer (and the same difficulties with the input length), and will make SSM meaningless. But you can apply the fast Fourier transform, which gives the complexity LlogL. The next important question is that you need to not only learn to perceive a long context, but also not lose local connections. To do this, in LOCOST the convolution goes in two opposite directions, the results are simply summed (in the encoder diagram on the left this is indicated as BiSSM).

LOCOST did not come up with anything new with the decoder, since the model was initially designed to generate small text from large text. The authors used a decoder from vanilla transformer.

The model was evaluated using the metrics ROUGE-1/2/Lsum, as well as BERTScore and BLANC. And fine tuning was carried out on scientific articles from arXiv and PubMed (the target was the actual abstracts of the articles) and on datasets with retellings of movies, books and US government reports.

Another interesting point is that it is not entirely clear how to qualitatively evaluate the result, because to do this you will have to read all these articles and books in full. You can’t feed it to GPT, and asking living people to do it is too expensive. The authors of LOCOST did not solve this problem, but evaluated the summary using GPT-3.5 for relevance and consistency.

The results on input text with a length of one thousand to 500 thousand tokens are approximately equivalent to LongT5 and LED, but the computational costs are much lower.

The best model LOCOST-32K was compared in the task of summarizing an entire book. It beat LongT5 and BART large, despite having the fewest parameters. In addition, she was the only one who was able to read 600 thousand tokens in one sitting without breaking into parts.

SPECTRUM

Habitual human dialogue is a very interesting goal for summarization, perhaps even more interesting than just a large, coherent text of an article or book. The point is in the internal structure of the dialogue, in which the participants’ remarks alternate, and in the formal features. People may not lose the general thread of a conversation for a long time without explicitly mentioning it. And it is in dialogue, more than anywhere else, that it is important to monitor the global context, which can completely change the local meaning.

Models perceive long dialogues as ordinary text. The structure itself is lost.

SPECTRUM (article) modifies the transformer retraining process so that information about the speaker and the internal structure of the dialogue are preserved. Training proceeds in two directions – predicting whose turn it is to speak and masked language modeling. The first adds understanding of the dialogue itself as a whole, and the second helps to trace the context.

The teaching methodology itself is also divided into two ways. One updates only the encoder, and the second updates the encoder and decoder. The first one is needed to predict the next speaker. To do this, a token is added to the beginning of each sentence and after the encoder a sequence of zeros and ones is obtained, indicating whether the speaker changes after this sentence or not.

The second way, masked modeling, goes at the sentence level (the authors also experimented with the word level, but this option turned out to be the best). Randomly selected sentences are replaced with masks and in this form are passed completely through the transformer.

The authors took datasets with transcripts of interviews, dialogues from books and supplemented them with their own dataset from real dialogues – they were taken from movies and TV series. User dialogs with GPT-3.5 (Soda dataset) were also added there. The maximum context length is 4096 tokens.

The results were compared to approximately the same with the models as in the case of LOCOST on the same Rouge metric.

Not perfectly convincing, but still SPECTRUM in some places surpassed LongT5 and LED, and BART large

More of our AI article reviews on the Pro AI channel.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *