Elimination of hallucinations in LLM

Let’s talk about why LLMs are lying and how to fix it

Image generated by Stable Diffusion

Image generated by Stable Diffusion

Translation of the article by Sergey Savvov.

Large Language Models (LLMs) can currently generate quick responses to various user queries. However, their penchant for juggling facts (or hallucinations) sometimes undermines confidence.

I think we will get the hallucination problem to a much, much better place… it will take us a year and a half, two years. — OpenAI CEO Sam Altman

Is this answer correct?

Is this answer correct?

Models are actively used by developers, but the restrictions put a “spike in the wheel” because the system must meet the requirements of quality, reliability and validity. For example, can you always trust the code generated by LLM? What about answers to questions about insurance, medicine, laws?

In this article, we will look at what LLMs hallucinations are, do some experiments and try to solve the problem of the reliability and veracity of the answers.

The information in this article is current as of August 2023, but please be aware that changes may occur subsequently.

  Comparison of the Accuracy of Different Approaches to Reduce Hallucinations

Comparison of the Accuracy of Different Approaches to Reduce Hallucinations

Hallucinations of LLMs occur due to data compression and inconsistency. Quality assurance is challenging as many datasets may be outdated or unreliable. To reduce the chance of error:

  1. Adjust the temperature setting to limit the creativity of the model.

  2. Pay attention to prompt engineering. Ask the model to think step by step and provide facts and citations in the answer.

  3. Use external sources of knowledge to improve the quality of answer checking.

    A combination of these approaches will achieve the best results.

What is an LLM hallucination?

  An example of falsification of facts: a total of 12 people visited the moon

In the article the Center for Artificial Intelligence Research defines the LLM hallucination as: “the generated content is meaningless or does not match the original content provided.”

Hallucinations can be divided into several types:

  1. Logical Errors: The model makes errors in its reasoning by providing incorrect answers.

  2. Falsification of facts: instead of answering “I don’t know”, the model confidently asserts non-existent facts. Example: Google’s AI chatbot Bard makes a mistake in its first demo.

  3. Model Bias: Lack of impartiality in sensitive topics can lead to unexpected results Example: Political biases found in NLP models.

Why do LLMs hallucinate

I liked the idea from this Articles: Compressing data to train a model leads to hallucinations. Let’s look at compression ratios for some popular models:

                  Training Data Compression

Training Data Compression

Of course, the reason for this compression is that the generative model stores a mathematical representation of the relationship (probabilities) between the input data (text or pixels) instead of the input itself. More importantly, this view allows knowledge to be retrieved (by fetching or executing queries/prompts).

This compression reduces fidelity, similar to the JPEG format, as discussed in the article. New Yorker. In fact, the complete restoration of the original knowledge becomes a difficult or almost impossible task. Models’ tendency to imperfect “fill in the gaps” or hallucinations is a compromise in favor of such a concise but useful representation of knowledge.

LLMs also hallucinate when their training dataset contains limited, outdated, or conflicting information about a question they have been asked. The latter is more often noticeable for rare or controversial facts.

Preparation for the experiment

The purpose of this article is to develop and test practical steps to reduce hallucinations and improve system performance. To do this, I looked at various datasets and settled on TruthfulQA Benchmark.

  Question example

Question example

Although the dataset has issues such as discrepancies between the correct answers and their sources, it remains the most appropriate option due to the variety of topics and comprehensive coverage. It was also convenient that the data was presented in a quiz format, which made it easier to test the model. You can easily request a response in JSON format:

… Return response in JSON format, for example: [{“class”: “A”}]

I used an 800 row dataset using GPT-3.5 turbo.

Other datasets to assess the impact of hallucinations

Temperature drop

Model temperature refers to a scalar value used to adjust the probability distribution predicted by the model. In the case of LLMs, it’s a balance between respecting what the model has learned from the training data and generating more varied or creative responses. Typically, these creative responses are more likely to hallucinate.

  Comparison of experimental results on temperature reduction

Comparison of experimental results on temperature reduction

For tasks that require certainty, aim for verbose context and set temperature=0 so that responses are based on context.

Chain of Thought Prompting and Self-Consistency

Benchmark errors can often be fixed by improving the design of your prompt. That is why I paid more attention to this topic.

LLMs often fail on step-by-step reasoning problems such as arithmetic or logic. Recent work show that examples of dividing a task into steps improves performance. Notably, a simple prompt like “Let’s think step by step” without specific examples leads to similar improvements.

Many articles are devoted to chain of reasoning techniques. Essentially, they aim to make the model think step by step and cross-check itself. The following are some prominent approaches:

Diagram illustrating different approaches to problem solving with LLM

Diagram illustrating different approaches to problem solving with LLM

Now let’s dive into each method and evaluate their quality based on the dataset.

1. Chain of Thoughts (CoT)

main idea Chain of Thoughts is to add the condition “Think step by step” to the prompt:

Think step by step before answering and return response in JSON format, for example: [{“class”: “A”}]”

Metrics: Accuracy = 58%

2. Self Consistency with CoT (CoT-SC)

This approach is an improved version of the previous idea. We ask the model to give several answers, and then we choose the best one by voting:

Think step by step before answering and give three answers: if a domain expert were to answer, if a supervisor were to answer, and your answer. Here’s the response in JSON format:

Metrics: Accuracy = 57%

3. Tree of Thoughts (ToT)

This framework, which summarizes thought chain clues and encourages exploration of thoughts that serve as intermediate steps to solve common problems with language models. This approach allows the LMM to independently evaluate the intermediate progress made in solving the problem through a process of deliberate reasoning. ToT prompt example:

Imagine three different experts are answering this question. All experts will write down 1 step of their thinking, then share it with the group. Then all experts will go on to the next step, etc. If any expert realizes they’re wrong at any point then they leave. Here’s the response in JSON format:

Metrics: Accuracy = 37%

4. Tagged Context Prompts

This method includes generating questions, creating and validating prompts with summary and validating questions.

Given the complexity of generating an additional dataset, I adjusted my approach by requesting a link to the source and facts:

    Diagram illustrating my version of tagged contextual prompts

Diagram illustrating my version of tagged contextual prompts

Provide details and include sources in the answer. Return response in JSON format, for example:[{“class”: “A”, “details”: “Human blood in veins is not actually blue. Blood is red due to the presence of hemoglobin”, “source”: “https://example.com“}]

Metrics: Accuracy = 61%

5. Self-Correct (self-control)

Perhaps it one of the most advanced prompt engineering methods. The idea is to have the model revalidate and critique its results, which are below:

  Schematic illustration of output validation

Schematic illustration of output validation

Choose the most likely answer from the list [“A”, “B”, “C”, “D”, “E”]. Then carefully double-check your answer. Think about whether this is the right answer, would others agree with it? Improve your answer as needed.

Return response in JSON format, for example: [{“first_answer”:”A”, “final_answer”:”B”}]

Metrics: Accuracy = 58%

6. Several Agents (several agents)

  Diagram illustrating the multi-agent approach

Diagram illustrating the multi-agent approach

Multiple LMM agents offer and discuss their individual responses and reasoning processes over several rounds to arrive at a common final answer. This approach includes prompts:

Prompt 1

Give the facts and your thoughts step by step to find the right answer to this question: {QUESTION}

Prompt 2

Using the solutions from other agents as additional information, choose the correct answer choice: {QUESTION} {ANSWERS}. Return response in JSON format…

Grade: Accuracy = 54%

I would not recommend using this approach in real applications because you need to make two or more requests. This not only increases the cost of the API, but also slows down the application. In my case, it took over two hours to generate answers to 800 questions.

Use an external knowledge base

As stated earlier, hallucinations in LLM arise from an attempt to recover compressed information. By inputting relevant data from the knowledge base during forecasting, we can transform the pure generation problem into a simpler search or generalization problem based on the data provided.

Since extracting relevant data from the knowledge base is not trivial in practice, I focused on a small sample (~ 300 rows) from the dataset I collected.

  Schematic illustration of the use of external sources

Schematic illustration of the use of external sources

As a result, my prompt looked like this:

Using this information {INFORMATION} choose the correct answer {QUESTION} and return response in JSON format…

Metrics: Accuracy = 65%

Some additional work needs to be completed. In particular, you need to filter and evaluate the extracted fragments and determine what part of the context to pass to the model. In addition, search and evaluation can introduce delays, which are very important for real-time interaction.

Another interesting approach is Retrieval-Augmented Generation (RAG), which combines search and text generation capabilities in large language models. This approach combines a retriever system for extracting relevant document fragments from a large corpus with LLM, which generates responses based on the extracted information.

  Schematic illustration of RAG, image by Heiko Hotz

Schematic illustration of RAG, image from Heiko Hotz

Some related articles

Prompt engineering and external knowledge base

This approach combines the previous points. Various prompt engineering methods and an external knowledge base are used. I implemented the logic from the CRITIC framework:

    CRITIC Framework

CRITIC framework

Using this information {INFORMATION} choose the correct answer {QUESTION} Then carefully double-check your answer. Think about whether this is the right answer, would others agree with it? Improve your answer as needed.

Return response in JSON format, for example: [{“first_answer”:”A”, “final_answer”:”B”}]

Metrics: Accuracy = 67%

The quality has not improved much compared to the previous paragraph. I think that this is due to the problematic issues in the dataset, which I wrote about at the beginning of the article. Some of the “correct” answers do not match the information from the sources.

Summarize

    Using the methods described in the article, we eliminated hallucinations

Using the methods described in the article, we eliminated hallucinations

On the one hand, reducing the hallucination is not such a difficult task: reduce the temperature, play with the prompts, add external knowledge. On the other hand, each of the approaches has many nuances.

My main advice is to give priority to prompt engineering – it is the most economical and effective way to eliminate hallucinations.

useful links

  1. Practical Steps to Reduce Hallucination and Improve Performance of Systems Built with Large Language Models – One of the best articles I’ve found.

  2. Reading list of hallucinations in LLMs – A useful GitHub repository with various links about hallucinations in LLM.

If you have any questions or suggestions feel free to contact LinkedIn.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *