Tune in to RAGAS and tune RAGAS to suit you

It is no secret that RAG (Retrieval-Augmented Generation) is now a common technique for using Large Language Models (LLM) in question-answering systems. And where there are ML models, there is also quality assessment. You will read about how to evaluate RAG models and automate this process for your task in this article.

Let's consider the standard RAG algorithm:

RAG Process Flowchart

RAG Process Flowchart

To obtain sufficiently universal metrics and algorithms for evaluating the RAG system, we will select text data common to all variations of the algorithm:

  1. questions asked of the RAG system;

  2. contexts – fragments (chunks) from the knowledge base that were selected to answer the question;

  3. answers given by the RAG system to the questions.

The classic and best in terms of quality option for assessing RAG systems is human evaluation. But this method is resource-intensive and not suitable for everyone. A quick and accessible alternative can be automatically calculated metrics or evaluation aspects.

Based on the extracted text data, RAG metrics can be classified according to the following aspects:

  • context quality metrics, or how correct the context for the question was selected by the RAG system;

  • response quality metrics, or how correct the answer was;

  • factual accuracy metrics for assessing the amount of hallucinations in the response generation process;

  • context ranking metrics, or context assessment through the prism of fragment ranking in the RAG system.

Types of aspects for assessing the RAG system. Green arrows indicate the dependency of the RAG component

Types of aspects for assessing the RAG system. Green arrows indicate the dependency of the RAG component

A tool from a small but proud startup, exploidinggradients, is vying for the place of a universal library for automatic evaluation of RAG systems — RAGAS.

In turn, RAGAS (Retrieval Augmented Generation Automated Scoring) is a framework for automatic evaluation of RAG systems. For your RAG system, RAGAS offers a wide range of metrics for evaluating answers and contexts obtained from questions to your knowledge base. In the absence of questions, they can be generated along with reference answers. The main tool used in all RAGAS algorithms are large language models with prompts specially selected for each task.

The algorithms provided by RAGAS can be divided into two types:

  1. algorithm for generating synthetic questions and answers based on a list of documents;

  2. algorithms for calculating metrics based on the results of the RAG system (and reference responses in some cases).

You can read about the calculation of metrics and the algorithm for generating synthetics in this blogand in my speech at the Data Fest conferenceLet's pay attention to the technical aspects of implementation.

Basic components of RAGAS generation and evaluation:

  • Generator LLM is a large language model responsible for generations. With its help, questions for standard answers are formed, key phrases are highlighted in JSON format, irrelevant questions for feedback are rewritten, translations are made.

  • Critic LLM is a large language model responsible for evaluation. It evaluates generation, forms feedback in JSON format, and calculates metrics.

  • Embeddings — a service for obtaining embeddings. It is used in some metrics, as well as when generating a question based on several fragments from the knowledge base.

These components form the classes of instruments:

  • Docstore — knowledge base storage.

  • Filters — a class that implements the assessment of the quality of text data at different stages of synthetic generation;

  • Evolution — a class that implements an algorithm for obtaining synthetic data from a fragment of a knowledge base, including the generation of synthetic questions and answers, their rewriting and evaluation;

  • Generator — a class that implements an algorithm for obtaining synthetic data;

  • Metric — a class implementing the RAG system evaluation metrics.

Synthetic generation algorithm

The generation algorithm assumes that RAGAS is provided with a document base in the langchain or llama_docs format, from which it is required to obtain a set of diverse questions to the knowledge base; fragments relevant to these questions; and reference answers relative to the generation model (if possible).

The process of generating synthetics looks like this:

  1. Formation of a knowledge base in which:

    1. documents are broken into fragments;

    2. embeddings are built for documents;

    3. For each fragment, 3-5 key phrases are generated that characterize different aspects of this fragment.

  2. Fragments from the knowledge base are filtered according to the assessment threshold, which is average according to the following criteria:

    1. Charity (understandability);

    2. Depth (depth of context);

    3. Structure (narrative structure);

    4. Relevance (homogeneity of the text in relation to the topic of the narrative).

  3. A question (Simple Question) is constructed from randomly selected fragments and their key questions. Then a procedure similar to self-refine: the question is assessed for adequacy (Question Filter), if generation fails, the question is regenerated, if further failure occurs, the fragment is resampled from the knowledge base.

  4. Questions (Simple Question) can be rewritten to make the test sample more complex. The authors call this evolution, and the following options are currently available:

    1. Reasoning Evolution – rewriting a question so that the answer requires more reasoning;

    2. Conditioning Evolution – adding a conditional element to the question;

    3. Multi-Context Evolution – another fragment with similar embedding is added to the existing context, and the question is rewritten so that the answer also requires information from the new context added to the fragment;

    4. Conversational Evolution (in progress) – the question is rewritten in a more user-friendly manner.

After each such evolution, the rewritten question is compared with the original one in “depth” and “width” (Evolution Filter). If necessary, the evolution or the original question is regenerated. During the generation of a question, several evolutions can be performed, they form a tree-like structure (see the picture below). When developing this algorithm, the authors were inspired Evol-instruct.

  1. For successfully generated questions, the generator model generates an answer (ground_truth). Of course, it is not necessarily “correct” in its essence; we trust the correctness of the answer exactly as much as we trust the correctness of the generator model (Generator LLM).

Possible variation of the evolution tree for the question

Possible variation of the evolution tree for the question “How to insure a Bathhouse”

Metrics calculation algorithm

The algorithms for calculating metrics differ quite a lot from each other. Let's take a look at some metrics that might be useful to everyone.

  • Context Relevancy is a context relevance metric. Using a prompt selected by the developers, only those sentences that were necessary to answer the question are extracted from the context. The result of the metric is a rational fraction — the share of sentences from the context that are relevant to the possible answer. The number of sentences in the text is calculated using the pysbt library, which, generally speaking, does not work well with the Russian language. Here is one of the striking examples of its work:

    The RAGAS implementation uses this configuration and, as you can see, it does not work correctly with abbreviations and line breaks in Russian. Despite the fact that in this metric pysbt is used only to count the number of sentences in the context, the final number is incorrect, which is quite critical. To use the metrics correctly, you can replace the use of pysbt with the library in the ragas/metrics/_context_relevancy.py files (or better yet, in all metrics files at once). splitwhich successfully copes with breaking Russian-language text into sentences.

  • Faithfulness is a metric of the factual accuracy of the answer relative to the given context. Here the answer is broken down into sentences (again via pysbt), from which a prompt is built to extract “claims” – some atomic facts, according to the model. An example of breaking down into such facts is given below:

    The resulting statements and context are then fed to another prompt, which checks the relevance of each statement relative to the context. The proportion of correct statements is the metric of the factual accuracy of the answer.

  • Answer Correctness (not to be confused with Answer relevancy) — its name translates as a metric of answer correctness, but it would probably be more correct to call it a metric of compliance with the reference answer. Since the reference answer may not be factually correct, especially if it is generated by a large language model, this metric does not reflect how correct the answer is. Focusing on such a metric gives a distillation of your RAG system's answers by Generator_llm answers, it is worth remembering this. To calculate this metric, we extract statements from the answer and the reference answer, as in the metric above, we extract statements. Then both of these lists are fed to the prompt input, which should generate three lists from them according to this classification:

    • TP (true positive): the statement from the answer is confirmed by one or more statements from the reference answer;

    • FP (false positive): the statement in the answer is not directly supported by any of the statements in the reference answer;

    • FN (false negative): the assertion is in the reference answer but is not present in the response.

Next, from the sizes of these lists, you can get the numbers TP, FP and FN, which are already familiar from the table metrics, based on which you can calculate the F1 metric. In some cases, it may not be enough, so another component for the calculation is the proximity of the embeddings of the answer and the reference answer (simple cosine distance). The weighted sum (the ratio of the weights of the F1 metric and the proximity metric can be set) of these numbers is the metric of the “correctness” of the answer.

How to run RAGAS

Not every large language model is capable of performing all these tasks well. Initially, the developers expected this library to be an add-on to OpenAI models, so all prompt engineering and quality measurements were performed on this model. Now the team is actively developing the capabilities of customizing custom models and other individual components of the library. Unfortunately, not all possible options can be found in the documentation, especially at the time of writing this article. Here are the options for customizing models that will help you run RAGAS on your favorite LLM.

What you should pay attention to when launching RAGAS in any way:

  • The technical parameters of the launched models are regulated by the configuration RunConfig()the following parameters are specified by default:

    Accordingly, if your models do not technically correspond to the declared parameters, the algorithm will crash with an error, most often associated with asynchrony. In order not to pass your configuration to each function and model, you can declare the configuration once (class object RunConfig) once with the parameters you need.

  • If your parameter max_retries not small, and the model generator is not very strong, then you will have to do a lot of regenerations, a lot. Take care of your budget, monitor the number of requests to models.

Langchain compatible model

If the model you want to use is available in the Langchain library, then you are very lucky: you can run it out of the box. Here we will provide an implementation using GigaChat and GigaEmbeddings as an example. We will also set the parameter distributionwhich is responsible for the share of certain evolutions that occur in the generated questions.

In this implementation, Gigachat with different temperatures acts as both a generator and a critic.

from langchain.chat_models.gigachat import GigaChat
from langchain_community.embeddings.gigachat import GigaChatEmbeddings
from ragas.testset.generator import TestsetGenerator
import os
from langchain.document_loaders import TextLoader
import pandas as pd
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from datasets import Dataset

embeddings = LangchainEmbeddingsWrapper( GigaChatEmbeddings(base_url="https://gigachat.devices.sberbank.ru/api/v1",
        auth_url="https://ngw.devices.sberbank.ru:9443/api/v2/oauth",
        credentials=os.environ['YOUR_CREDS'],
        scope="GIGACHAT_API_PERS",
        verify_ssl_certs=False))

generator_llm =LangchainLLMWrapper(GigaChat(
        base_url="https://gigachat.devices.sberbank.ru/api/v1",
        auth_url="https://ngw.devices.sberbank.ru:9443/api/v2/oauth",
        credentials=os.environ['YOUR_CREDS'],
        scope="GIGACHAT_API_PERS",
        model="GigaChat-Pro",
        timeout=60.0,
        verbose=True,
        verify_ssl_certs=False,
        temperature=1.05,
        top_p=0.36,
        profanity=False,
        max_tokens=200,
    ))
critic_llm=  LangchainLLMWrapper(GigaChat(
        base_url="https://gigachat.devices.sberbank.ru/api/v1",
        auth_url="https://ngw.devices.sberbank.ru:9443/api/v2/oauth",
        credentials=os.environ['YOUR_CREDS'],
        scope="GIGACHAT_API_PERS",
        model="GigaChat-Pro",
        timeout=60.0,
        verbose=True,
        verify_ssl_certs=False,
        temperature=1e-8,
        profanity=False,
        max_tokens=200,
    ))

dataframe = pd.read_csv(os.environ['PATH TO YOUR DATAFRAME'])
dataframe['contexts'] = [x for x in dataframe['document']]
loader = DataFrameLoader(dataframe, page_content_column="YOUR CONTEXTD COLUMN")
df = loader.load()

generator = TestsetGenerator.from_langchain(
   generator_llm, 
    critic_llm,
    embeddings,
    docstore
)

testset = generator.generate_with_langchain_docs(df, TEST_SIZE, raise_exceptions=False, with_debugging_logs=True, is_async=False)
testset.to_pandas

Hugging face via VLLM

A very pressing issue is the question of launching RAGAS on models from Hugging face. Here a method of launching will be proposed using the Command-R model as an example. It is launched via VLLM (list of available models), and is used via ChatOpenAI, wrapped in a wrapper from Langchain. It sounds tricky, but the implementation is quite elegant:

Command to start VLLM:

export HF_TOKEN='*****************'
python -m vllm.entrypoints.openai.api_server --model CohereForAI/c4ai-command-r-v01 --tensor-parallel-size 1 --gpu-memory-utilization 1

The Hugging Face token may or may not be needed, depending on the model used.

It is worth noting that VLLM works exclusively under Linux OS and Python 3.8-3.11. In addition, at the time of writing, some libraries for quantization such as bitsandbytes are not implemented in it – it is currently not possible to run quantized Command-R in this way.

from langchain_openai import ChatOpenAI
from ragas.llms import  LangchainLLMWrapper

inference_server_url = "http://localhost:8000/v1"

# create vLLM Langchain instance
chat = ChatOpenAI(
    model="CohereForAI/c4ai-command-r-v01",
    openai_api_key="no-key",
    openai_api_base=inference_server_url,
    max_tokens=1024,
    temperature=0.3,
)
# use the Ragas LangchainLLM wrapper to create a RagasLLM instance
vllm =  LangchainLLMWrapper(chat)

Then you can pass the API through ChatOpenAI and wrap it in a RAGAS Langchain-compatible class.

When using custom non-Langchain models, you need to explicitly define models for intermediate structures: knowledge base repositories (it is important to declare a model for extracting key phrases) and the structure of question evolutions (and all filters included in them). If desired, you can override the prompt for filters or other tools.

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context, conditional
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
import os
from ragas.testset.extractor import KeyphraseExtractor
from langchain.text_splitter import TokenTextSplitter
from langchain.document_loaders import TextLoader
from ragas.testset.evolutions import ComplexEvolution
import pandas as pd
from ragas.run_config import RunConfig

from ragas.testset.filters import QuestionFilter, EvolutionFilter, NodeFilter
from ragas.llms import LangchainLLMWrapper
from ragas.testset.docstore import InMemoryDocumentStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.testset.prompts import (
    context_scoring_prompt,
    evolution_elimination_prompt,
    filter_question_prompt,
)

generator_llm =  vllm
critic_llm =  vllm
embeddings= LangchainEmbeddingsWrapper(vllm_embeddings)

qa_filter = QuestionFilter(critic_llm, filter_question_prompt)
node_filter = NodeFilter(critic_llm, context_scoring_prompt=context_scoring_prompt)
evolution_filter = EvolutionFilter(critic_llm, evolution_elimination_prompt)
distributions = {
    simple: 0.5,
    reasoning: 0.25,
    conditional: 0.25
}
splitter = RecursiveCharacterTextSplitter(chunk_size=250, chunk_overlap=20)
keyphrase_extractor = KeyphraseExtractor(llm=generator_llm)

docstore = InMemoryDocumentStore(
    splitter=splitter,
    embeddings=embeddings,
    extractor=keyphrase_extractor
)

generator = TestsetGenerator.from_langchain(
   generator_llm, 
    critic_llm,
    embeddings,
    docstore
)
for evolution in distributions:
    evolution.generator_llm = generator_llm
    evolution.question_filter = qa_filter
    evolution.node_filter = node_filter
    evolution.docstore = docstore
    evolution.evolution_filter = evolution_filter

testset = generator.generate_with_langchain_docs(df, TEST_SIZE, raise_exceptions=False, with_debugging_logs=True, is_async=False)

Now you can use the model from Hugging Face.

BaseRagasLLM

The most painful, poorly described and most open to maneuvers way to fit your model into RAGAS. It involves creating a class inherited from the abstract class BaseRagasLLM and implementing asynchronous and synchronous methods of generating the result on prompt:

  • agenerate_prompt — asynchronous generation by list of prompts;

  • generate_prompt — synchronous generation according to the list of prompts;

  • generate_text — generation by one prompt;

  • agenerate_text – asynchronous generation by one prompt.

Entities fed to the input (PromptValue), and the values ​​returned by the function (LLMResult) — Langchain-compatible. Due to the lack of documentation, clear description, and testing, there may be many pitfalls on the way to using this approach, so it is difficult to recommend.

Epilogue

So, you have studied the technical component of the service for automatic assessment of RAG systems, and also considered ways to launch RAGAS with your own model. In addition, now you know the pitfalls of RAGAS and its limitations, in particular, when working with the Russian language. I hope that the experience gained from studying this approach to assessing RAG systems was useful for you.

If you want to join our team and solve interesting problems in machine learning, write here.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *