How to communicate with a knowledge base in natural language using LLM and objectively evaluate the performance of the resulting system

Fine-tuning models

One of the options for improving the quality of the pipeline is fine-tuning LLM on the domain used in the knowledge base. For example, you can apply LoRA/QLoRA approaches.

Completing the pipeline

The GPT 3.5 Turbo model was chosen as the generation model, since it provides fairly good generation quality at a lower cost (for example, in comparison with GPT-4).

# Setting up the LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

system_message_prompt = SystemMessagePromptTemplate.from_template(Settings.PROMT_TEMPLATE)
chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt])

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


# Loading texts
docs = load_and_split_markdown('data/docs/bank_name_docs.md', Settings.HEADERS_TO_SPLIT)

# Setting up the retriever
ensemble_retriever = get_retriever(
    docs, 
    Settings.BM25_K, Settings.MMR_K, Settings.MMR_FETCH_K, 
    Settings.METADATA_INFO, Settings.CONTENT_DESCRIPTION
)


# RAG pipeline
rag_chain_from_docs = (
    {
        "context": lambda input: format_docs(input["documents"]),
        "question": itemgetter("question"),
    }
    | chat_prompt
    | llm
    | StrOutputParser()
)

rag_chain_with_source = RunnableParallel(
    {"documents": ensemble_retriever, "question": RunnablePassthrough()}
) | {
    "documents": lambda input: [doc.metadata for doc in input["documents"]],
    "answer": rag_chain_from_docs,
}

Another interesting feature that was added is the indication of the context header used when generating the response.

For the final touch, I added a ChatGPT-like interface for comfortable interaction with RAG. Here's what happened:

RAG rating

Now that you know how to build a RAG, you need to answer the question: how well does it work?

For these purposes, we need to create an assessment system and test what we have achieved, the RAGAs library will help us with this.

RAGAs is an open source framework designed to evaluate pipeline components without human assistance. It allows you to create a test dataset and get an assessment of the RAG we built.

At the same time, in the test dataset, GPT-3.5 is used to create questions based on the text, and the answers to them are generated by GPT-4 as the most advanced model at the current time. You can also add manually compiled questions and answers to the dataset if you wish.

RAGAs accepts the following input parameters:
1. question – the user’s question that he submits to RAG;
2. answer – the answer generated by our pipeline;
3. contexts – contexts used to answer the question;
4. ground_truth – the correct answer to the user's question.

Now let's talk about the metrics we will use. Formally, they can be divided into two independent parts – generation assessment and extraction assessment.

Faithfulness
This metric is aimed at identifying actual inconsistencies between the generated response and the context. It allows you to count the number of hallucinations a model has—incorrect information or information not based on context—about all the statements in the answer.

Answer relevance
How well the generated answer matches the question. Allows you to understand to what extent the system's responses contain incomplete, repetitive or redundant information.

Context precision
A numerical measure of how well the retrieved context matches the information needed to answer the question. This metric is calculated as the ratio of correctly extracted pieces relative to their total number. It allows you to find the optimal chunk size when splitting text.

Context recall — measures how relevant context was found by the retriever relative to ground_truth responses and is the only metric that uses them.

Let's evaluate the pipeline

First, you need to generate a synthetic dataset with questions based on the knowledge base. RAGAs has a convenient class – TestGenerator, which allows you to create a dataset in a few lines. However, prompts in English are hardwired inside it, and in order to receive answers in Russian, we had to make clarifications to all the prompts used by this class.

To do this, I added the phrase to each prompt: “Your task is formulated in English, but the answer must be in the language of the context,” and then, inheriting from the TestGenerator class, redefined the functions that use prompts.

SEED_QUESTION = HumanMessagePromptTemplate.from_template(
    """\
Your instructions are given in English but the answer should be in the same language as the context.
Your task is to formulate a question from given context satisfying the rules given below:
    1.The question should make sense to humans even when read without the given context.
    2.The question should be fully answered from the given context.
    3.The question should be framed from a part of context that contains important information. It can also be from tables,code,etc.
    4.The answer to the question should not contain any links.
    5.The question should be of moderate difficulty.
    6.The question must be reasonable and must be understood and responded by humans.
    7.Do no use phrases like 'provided context',etc in the question
    8.Avoid framing question using word "and" that can be decomposed into more than one question.
    9.The question should not contain more than 10 words, make of use of abbreviation wherever possible.
    
context:{context}
"""  # noqa: E501
)

If you are interested in understanding in detail how the dataset construction cycle works in TestGenerator, I recommend that you familiarize yourself with this article.

from langchain.document_loaders import TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from ragas.llms import LangchainLLM
from generator import RussianTestGenerator

loader = TextLoader('data/docs/bank_name_docs.md')
docs = loader.load()


# Add custom llms and embeddings
generator_llm = LangchainLLM(llm=ChatOpenAI(model="gpt-3.5-turbo"))
critic_llm = LangchainLLM(llm=ChatOpenAI(model="gpt-4"))
embeddings_model = OpenAIEmbeddings()

# Change resulting question type distribution
testset_distribution = {
    "simple": 0.25,
    "reasoning": 0.25,
    "multi_context": 0.25,
    "conditional": 0.25,
}


test_generator = RussianTestGenerator(
    generator_llm=generator_llm,
    critic_llm=critic_llm,
    embeddings_model=embeddings_model,
    testset_distribution=testset_distribution
)

synth_data = test_generator.generate(docs, test_size=15).to_pandas()

After creating the dataset, you need to receive answers and accompanying context from the RAG system

answers = []
contexts = []

for query in tqdm(synth_data.question.tolist(), desc="Generation answers"):
    answers.append(rag_chain_with_source.invoke(query)['answer'])
    contexts.append([unicodedata.normalize('NFKD', docs.page_content) for docs in ensemble_retriever.get_relevant_documents(query)])

ground_truth = list(map(ast.literal_eval, synth_data.ground_truth.tolist()))

data = {
    "question": synth_data.question.tolist(),
    "answer": answers,
    "contexts": contexts,
    "ground_truths": ground_truth
}

dataset = Dataset.from_dict(data)

result = evaluate(
    dataset = dataset, 
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ],
).to_pandas()

Thus, we received a dataset of 15 examples, three of them can be seen in the image.

The average values ​​of metrics on the dataset are as follows:
context_precision — 0.586185516
context_recall — 0.855654762
faithfulness — 0.852083333
answer_relevancy — 0.836044521

Based on the obtained metrics, we can say that the system can be improved mainly through experiments with the retriever.

The complete example code along with data and evaluation can be found on github.

Today we looked at the step-by-step creation of a RAG system and what subtleties exist in the development of each stage, and also received a numerical assessment of the system’s operation using the RAGAs framework.

If the topic interests you, but there are no resources to dive into the technical context, please contact us, We will advise and solve your problem.

You can get acquainted with Doubletapp ML projects on the company website.

Thank you for your attention!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *