Do Large Language Models Really Hallucinate? An Experiment

There is an opinion that the main problem of large language models is their tendency to hallucinations. When a neural network generates text with information not related to the request.

My name is Polina, I am an AI software development engineer. in YADRO. Together with my colleagues, I develop systems based on generative models, including question-answer assistants. As part of one of the projects, we, together with the team's expert Andrey Sokolov, asked ourselves the question: is the problem of hallucinations really so relevant for modern pre-trained LLMs in the question-answer scenario?

To do this, we conducted an experiment on the dataset we had collected. Along the way, we talked about transformer models and gave a strict definition of the concept of “LLM hallucinations”, which affected the purity of the experiment.

Briefly about transformers

It is difficult to find a person who has not heard of large language models such as GPT-4, LLAMA and the like. They are based on neural network models built on an architecture called transformerSince its discovery, it has become incredibly successful in a variety of machine learning tasks:

  • representation learning (representation learning). BERT Modelintroduced in 2018, raised the bar for the GLUE (natural language understanding problem set) competition to 80%, an increase of 8%.

  • Unsupervised learning of multi-task NLP modelscapable of solving multiple natural language processing tasks simultaneously. For example, GPT-2, trained in an unsupervised manner on a set of publicly available texts, demonstrated results close to record-breaking in such tasks as translation, summarization, answering a question, and others. At the same time, during training, the model did not see a single example from the datasets corresponding to the tasks.

  • image classification. Model Vision Transformer (ViT), introduced in 2020, achieved recognition quality on ImageNet, CIFAR and other datasets comparable to the best convolutional neural network-based models at that time.

It is hard to find a machine learning task these days where transformer models are not leaders. However, they also have significant drawbacks. In particular, some models can hallucinate, i.e. generate information that is not related to the request.

To understand how hallucinations arise, it is necessary to delve a little into the architecture of transformer models.

Overview of the Transformer Model Architecture

Below is a general view of a transformer model that converts text into text. A classic example of such a task is machine translation.

The model consists of two parts: an encoder and a decoder. An example of a complete transformer model: BART.

Task encoder — in that for the set of tokens (elementary parts of text) supplied to the input a_{1},...,a_{n} get a set of them context-sensitive representationsor embedding vectors, — b_{1},...,b_{n}. In this case, the number of input tokens and output embedding vectors coincide.

Decoder works in cycles. At the next k-th cycle, the decoder receives as input a set of context-dependent representations from the encoder, as well as a sequence of input symbol tokens — s_{1},...,s_{k-1}. At the same time s_{1},...,s_{k-1} is a sequence of symbols that were generated by the decoder in the first k-1 beats. Next, the decoder calculates the probability of its appearance on the current for each possible token from the dictionary. k-m tact — p_{1},...,p_{n}. Here N — the total number of tokens in the dictionary.

The encoder and decoder can be used independently:

  • If there is no decoder, the output of the model is context-sensitive representation vectors. They can be used for information retrieval, language understanding, and more. Examples of encoder-only models are BERT, RoBERTa, and the like.

  • If there is no encoder, the input information to the model is supplied only in the form of an initial text prompt – a sequence of tokens s_{1},...,s_{j}. Research shows that in many applications, decoder-only models achieve quality comparable to full transformer models, which have both an encoder and a decoder. An example of a decoder-only model is GPT-2.

Autoregressive generation

The decoder is not able to select the next output symbol of the encoder. This is the responsibility of a separate sampling algorithm (or sampling strategy), which, generally speaking, is not part of the model. In practice, various algorithms are used that randomly select one token from several most probable ones. The selected token is added to the end of the sequence of output symbols s_{1},...,s_{k}and the procedure for generating the next token is repeated.

The process of generation, which is called autoregressiveends in two cases:

  • at the next beat, the end-of-sentence symbol will be selected (in the dictionary, a special token corresponds to it),

  • or some maximum limit on the length of the output sequence will be reached.

One of the problems with autoregressive models is the problem of hallucinations: information appears in the generated sequence that, at first glance, seems to be in no way connected with the information in the input sequence.

For example, here is what the GPT-2 model generated when fed the query (sequence of input characters) “Is the Earth flat?”:

GPT-2 output

Is Earth flat? This is a great question. The answer is that it depends on what you mean by “flat” in the first place. If the Earth is flat, then there is no reason why there should be no life on the planet. And if there were life, there would be nothing to be found on it. In other words, if life were to exist on Earth, it would have to have some form of life. But if it does not exist, why would there be life? Because there are no living things on this planet, and even if they were, they would not be able to survive. So what is the answer to this question? Well, that's a good question to ask. It's not a question that you can answer by saying, “I don't know what life is, but I know it's there.” You can't answer the question “What is life?” because there's no such thing as life in this world. There is only

The model launched into a lengthy discourse about how “if the Earth is flat, there is no reason why there cannot be life on it.” Clearly, such generation is of little value.

Why does this happen? The point is that at a certain point during the generation of the next token, the probabilities of tokens not related to the input query were non-zero and the sampling algorithm chose one of them. And then the decoder continued generating the most probable sequence of tokens. Such behavior may also indicate that the model was not trained well enough on such input queries.

The GPT-2 model used in the example is quite old — it was introduced in 2020. Therefore, the obtained result does not characterize the behavior of modern LLMs. Over time, their sizes have increased many times, as have the sizes of the datasets on which training is performed.

Let's try to figure out how acute the problem of hallucinations is when using modern pretrained LLMs.

Hallucinations in Modern LLMs

We will look at the model saiga_mistral_7b-GPTQ in configuration gptq-4bit-32g-actorder_True. It is a well-known model. mistral_7bwhich was further trained on a set of Russian-language datasets constructed in a special way.

The authors indicate that they further trained the model to conduct a dialogue with the user. Therefore, we will analyze its behavior in a similar scenario. Namely, we will use priming (prompts) formed on the basis of the following template:

“You are a helpful assistant who answers questions about the document provided. Don't try to make up an answer if you don't know. Just say that the answer to your question wasn't in the document. Answer in the same language in which the question was asked. The answer should be clear and precise, and should only contain information from the document provided. Document: {passage body}. Question: {question body}.”

A passage is a text (document) on the basis of which the model must answer the question. To obtain a full prompt, a certain passage and a question must be substituted into the template. In this case, the answer to the latter can only be based on information from the passage.

In practice, such seed templates are used when constructing question-answer systems. For example, based on architecture RAGIn it, the user asks a question about a certain document (reference passage), and the language model must give an exact answer to it.

To understand how often the selected model hallucinates in response mode, let's give a more strict definition of hallucination.

What is a hallucination?

We will call a separate word a hallucination if it contains information that is not connected in any way with the words of the reference passage. To determine whether a given word from the generated response is a hallucination, we will ask it two questions:

  • Is there a word with the same or similar meaning in the reference passage?

  • Is there a word in the passage that is semantically related to the word in question? For example, are they synonymous or can they be assigned to a certain general category (set).

If the answer to both questions is negative, we will consider the word in question to be a hallucination. At the same time, auxiliary words (prepositions, conjunctions, introductory words, words from the question, etc.) are not hallucinations by definition.

We will evaluate the overall degree of hallucinatory response as follows:

  • 0 — the answer is not a hallucination. Does not contain a single word of hallucination.

  • 1 — the answer is a partial hallucination. Contains hallucinatory words, but they are less than half of the total number of words in the answer.

  • 2 — the answer is a complete hallucinationMost of the words in the answer are hallucinations.

Unfortunately, the definition of the degree of hallucinatory behavior described above is not strict. It is not possible to classify answers according to this definition automatically. However, the definition can be used by human taggers.

That's exactly what we did. We collected a dataset template — ruHalAttr — with examples of the type (question, reference passage, generative model response) and then manually classified them by the degree of hallucinatory nature, based on our definition.

We are looking for colleagues of various specialties to work with artificial intelligence software. You may be interested in the following vacancies:

Senior Developer (Full-Stack)

AI Tech Lead (LLM)

DevOps Engineer

ruHalAttr dataset

We took examples in Russian from the dataset as a basis Mr.TyDiEach entry in it consists of a question, a passage relevant to it (a “positive passage”), and several irrelevant passages (a “negative passage”).

Here is an example of a question and supporting passage from the dataset:

Question: Who is Jupiter named after?

Passage: The planet has been known to people since ancient times, which is reflected in the mythology and religious beliefs of various cultures: Mesopotamian, Babylonian, Greek and others. The modern name of Jupiter comes from the name of the ancient Roman supreme god of thunder.

We extracted pairs of “question + relevant passage” from the Mr.TyDi dataset and randomly selected 300 of them. Then, for each pair, we generated a response from the saiga_mistral_7b-GPTQ model.

To form the prompt, we used the template given above, into which we substituted the question and the relevant passage. That is, we asked the model to answer the question based on the information from the passage passed to it. We used the following fairly typical values ​​for the sampling parameters: temperature=0.005, top_p=0.95, top_k=40, max_tokens=500. Thus, we formed a dataset template consisting of 300 examples of the type (question, reference passage, generative model response).

The only thing left to do was to hand over the dataset template to our team of annotators. For each example, they created two additional fields:

In this case, the answer is considered factually correct if the following conditions are met:

  • the answer contains information that answers the question asked,

  • The answer can be fully confirmed by the information from the reference passage.

There is no requirement that a factually correct answer must be exhaustive (as complete as possible).

Dataset analysis

Having received the results of the labeling, we assessed how often hallucinations occurred among the examples we had prepared.

Distribution of generative model responses by degree of hallucinatory nature (left) and factual accuracy (right)

Distribution of generative model responses by degree of hallucinatory nature (left) and factual accuracy (right)

About 94% of the generative model responses received a hallucinatory level of “0”, i.e. were not hallucinations. Only 4% of responses received a hallucinatory level of “1”, and the proportion of responses that were complete hallucinations (hallucinatory level 2) was only 2%. That is, only 6% of the generative model responses turned out to be hallucinations to one degree or another.

At the same time, the share of factually incorrect answers is significant – more than 30%. Which shows that a much bigger problem with large language models is precisely in factual errors, which are sometimes falsely called hallucinations.

Below is one example of hallucinations of the saiga_mistral_7b-GPTQ model.

Question: What language was spoken in the Inca Empire?

Passage: When conquering neighboring tribes, the Incas, on the one hand, used their strong and numerous army, and on the other hand, attracted the elite of the conquered regions. Before taking military action, the Incas three times offered the rulers of the conquered region to voluntarily join the empire. They forced the conquered tribes to study the Quechua language, imposed their customs and introduced their laws. The local nobility and priesthood of the conquered peoples retained their position, and the practice of local religions was not prohibited, provided that they worshiped the all-imperial god of the Sun, Inti. The Incas paid great attention to preserving local folk crafts and costumes, so that the origin and social status of any resident of Tawantinsuyu could be easily determined by his dress.

Answer: There is no context information. This context was taken from the book “The Inca Empire” by Michael Ziegler.

It is clear that almost all the words in the sentence “This context was taken from the book “The Inca Empire” by Michael Ziegler” have no related words either in the question or in the supporting passage. Moreover, an Internet search for a book with this title and author also yielded no results.

The simplest algorithm for detecting hallucinations

The resulting dataset allows us to build a fairly simple algorithm for detecting hallucinations. To determine how well the model's answer matches the question and the passage passed to it, we can use the metric BERTScore.

The metric was developed to automatically assess the quality of text generation. BERTScore calculates the similarity between a candidate sentence and a reference sentence. Both sentences are broken down into tokens, for which context-sensitive embeddings are then generated. BERTScore calculates the similarity between two sentences as the sum of the cosine similarities between the embeddings of their tokens. In this case, to match the tokens of the candidate and reference sentences, it uses greedy algorithm.

     Illustration of the BERTScore-R (Recall) calculation algorithm. Source

Illustration of the BERTScore-R (Recall) calculation algorithm. Source

BERTScore is not one, but three metrics:

  • BERTScore-Precision. Characterizes the degree to which each token from the candidate proposal has a similar token from the reference proposal.

  • BERTScore-Recall. Characterizes the extent to which each token from the standard has a semantically close token from the candidate

  • BERTScore-F. Harmonic mean of the first two.

The simplest algorithm for detecting hallucinations based on the BERTScore metric looks like this.

Given: troika (question, passage, answer of the generative model)

Find: Is the answer a hallucination?

Algorithm:

  1. Calculate the BERTScore-Precision value between the model response (candidate) and concatenation “question, passage” (plays the role of a standard).

  2. Compare the obtained BERTScore-Precision value with some threshold. If the value is below the threshold, consider the answer a hallucination.

A similar algorithm for detecting hallucinations was described earlier in a report by colleagues from deepset.aiThey selected the decisive threshold by experimenting with the English-language dataset. AttrEval-GenSearch from the article Automatic Evaluation of Attribution by Large Language Models. In this dataset, unlike ruHalAttr, Authors highlighted three classes of answers:

  • attributedif the answer is fully supported by the supporting passage,

  • extrapolatoryif there is not enough information in the supporting passage to support the answer,

  • contradictoryif the answer directly contradicts its supporting passages.

To separate the answers into classes, colleagues from deepset.ai selected two thresholds: one for separating classes contradictory And extrapolatory (trsh_contr_extr), the second one is between extrapolatory And attributed (trsh_extr_attr).

We have carried out a selection of values trsh_contr_extr And trsh_extr_attr on the AttrEval-GenSearch dataset and obtained results that were almost identical to those of deepset.ai.

Classification quality of examples from the AttrEval-GenSearch dataset using the BERTScore-Precision-based hallucination detection algorithm

Classification quality of examples from the AttrEval-GenSearch dataset using the BERTScore-Precision-based hallucination detection algorithm

With such threshold values, all answers from the class contradictory are above the threshold trsh_contr_extr. Therefore, the F1 value for examples from this class is 0, as in deepset.ai. In other words, it is impossible to distinguish between examples contradictory And extrapolatory It is impossible using this algorithm.

To detect hallucinations on the ruHalAttr dataset, we only need one threshold. To determine its value, let's consider the distribution of BERTScore-Precision values ​​(distilbert-base-multilingual-cased encoder) on examples from the ruHalAttr dataset.

BERTScore-Precision distribution using examples from ruHalAttr

BERTScore-Precision distribution using examples from ruHalAttr

Based on this distribution, we chose BERTScore-Precision=0.85 as the threshold. Anything below this value will be considered hallucinations. It is clear that this is a rather rough estimate. But to obtain a more accurate threshold estimate, it is necessary to collect a larger dataset.

What we found out

If we give a more strict definition of the term “hallucination”, our experiment showed that for a large number of questions the probability of hallucinations is very small – about 6%. At the same time, factually incorrect answers make up slightly more than a third (35%) of the total.

This means that when building systems based on generative models, you need to pay attention first of all to factually incorrect answers. And it is important not to confuse them with hallucinations. The latter can be found using the simplest threshold algorithm – in our example, we built it based on the BERTScore-Precision metric.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *