solving multidimensional problems using LLM

The RAG process involves three main steps:

  1. Search: At this stage, the model retrieves relevant information from external databases. For this, indexing and search methods are used, such as Locality-Sensitive Hashing and k-Nearest Neighbors. The process begins with converting text into embeddings, which are stored in a vector database. These vectors help the model quickly find the most relevant information.

  2. Addition: the extracted information is added to the original data fed into the model, enriching the context. This process involves adding relevant data to the user's query, allowing the model to use the augmented information to generate a more accurate and relevant answer. At this stage, prompt engineering begins to integrate new data with the original context.

  3. Generation: The model uses augmented context to create an informed and relevant response. At this stage, the answer is synthesized based on the internal data of the model and additional information. information obtained during the search stage. Using transformer mechanisms, such as self-attention, the model generates text that is not only relevant, but also based on the latest data

For example, RAG using Hugging Face Transformers and Faiss:

import torch
from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration

# инициализация токенизатора, ретривера и модели
tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-nq")
retriever = RagRetriever.from_pretrained("facebook/rag-token-nq", index_name="exact")
model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq", retriever=retriever)

# ввод пользователя
question = "What is the capital of France?" # интересный вопрос

# токенизация и извлечение контекста
input_ids = tokenizer(question, return_tensors="pt").input_ids
retrieved_docs = retriever(input_ids)

# генерация ответа
outputs = model.generate(input_ids, num_beams=5, num_return_sequences=1)
generated_answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Question:", question)
print("Answer:", generated_answer)

Multi-Head RAG

Represents an extension of the traditional RAG architecture, introducing a multi-head approach. The vanilla RAG uses a single search and generation engine, which may not be sufficient for complex and multi-aspect tasks. Multi-Head RAG solves this problem as follows:

In Multi-Head RAG each head in the model is responsible for processing a specific aspect of the task.

Each model head can be configured to work with different types of data, such as text documents, images or audio. Multi-head architecture allows you to speed up data processing by running multiple heads in parallel.

Each head in a Multi-Head RAG can be configured to handle specific types of information or contexts. For example:

  1. Texts of scientific articles: one head can be specially trained on texts from scientific journals and databases, which gives the feature analyzes of scientific literature.

  2. Social media data: Another head can be configured to analyze social media data. networks, which allows you to take into account the context and specifics of social networks. media.

  3. Commercial information: The third head can process data from commercial sources, such as news, various financial. reports, etc., thereby providing up-to-date answers.

The results from each head are combined to form the final answer.

The extracted data is collated and filtered to remove duplicate or irrelevant information.

Data from various sources is combined to create a single, consistent answer.

So the architecture would look like this:

  1. Search module: Includes several heads, each of which is responsible for searching for information in specific data sources.

  2. Generation module: Uses aggregated data to create an informed response. The generative model is integrated with search results.

  3. User Interface.

Example implementation:

import torch
from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration
from faiss import IndexFlatL2
import numpy as np

# инициализация токенизатора и модели
tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-nq")
retriever_1 = RagRetriever.from_pretrained("facebook/rag-token-nq", index_name="exact")
retriever_2 = RagRetriever.from_pretrained("facebook/rag-token-wiki", index_name="exact")
model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq", retriever=[retriever_1, retriever_2])

# пример пользовательского вопроса
question = "What are the recent advancements in quantum computing?"
input_ids = tokenizer(question, return_tensors="pt").input_ids

# извлечение релевантных документов из нескольких источников
retrieved_docs_1 = retriever_1(input_ids)
retrieved_docs_2 = retriever_2(input_ids)
combined_docs = retrieved_docs_1 + retrieved_docs_2

# генерация ответа
outputs = model.generate(input_ids, context_input_ids=combined_docs, num_beams=5, num_return_sequences=1)
generated_answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Question:", question)
print("Answer:", generated_answer)

# инициализация Faiss для кастомного индекса
dimension = 768
index = IndexFlatL2(dimension)
data = np.random.random((1000, dimension)).astype('float32')
index.add(data)
query_vector = np.random.random((1, dimension)).astype('float32')
D, I = index.search(query_vector, 10)

print("Nearest Neighbors:", I)

# функция для интеграции Faiss с RAG
def retrieve_custom_index(query, index, tokenizer, model):
    query_vector = tokenizer(query, return_tensors="pt").input_ids.numpy()
    D, I = index.search(query_vector, 10)
    retrieved_docs = [tokenizer.decode(idx) for idx in I[0]]
    return retrieved_docs

# пример пользовательского вопроса
custom_question = "What are the recent trends in AI?"
custom_retrieved_docs = retrieve_custom_index(custom_question, index, tokenizer, model)

# генерация ответа с использованием кастомного индекса
custom_input_ids = tokenizer(custom_question, return_tensors="pt").input_ids
custom_outputs = model.generate(custom_input_ids, context_input_ids=custom_retrieved_docs, num_beams=5, num_return_sequences=1)
custom_generated_answer = tokenizer.decode(custom_outputs[0], skip_special_tokens=True)

print("Custom Question:", custom_question)
print("Custom Answer:", custom_generated_answer)

Two retrievers are used to work with different data indexes.

We also tokenize the user’s request and retrieve documents from two different sources.


Multi-Head RAG improves accuracy by processing different aspects of a task in parallel, allowing for more context and data diversity to be taken into account. Traditional models are often limited in this regard and can make mistakes.

OTUS experts talk about RAG and other models and tools as part of practical machine learning courses. Go to the catalog and choose the appropriate direction.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *