Data processing for RAG systems

Hello everyone, my name is Andrey Shakhov, I am a Python developer and Lead Backend Developer at wpp.digital[ссылка удалена мод.]. I started working in the ML direction, or rather LLM, in the company only at the end of 2023. Now I spend about 40% of my working time on tasks of this kind.

I decided to start leveling up with a simple internal task – to reduce the time spent searching for information in the corporate wiki using LLM. The business result is transparent – each employee should find the answer to their request in a couple of seconds, without a long journey through all the pages of the knowledge base.

Judging by the content on habr, which I managed to view by this time, one of the most popular ways to use LLM is to build RAG systems. More on this later.

Retrieval Augmented Generation (RAG) is a system that transfers data to LLM not in its raw form from the user, but enriches it with context. The system allows you to expand your knowledge of the model without complex and lengthy training. In some cases, it may even be enough to work with a remote API of the same OpenAI, respectively, your backend will be small and fast. The alternative is to further train the model with your own data, but the answers will still be inaccurate due to the size of the model, and such training requires a lot of resources.

Now I will tell you how to prepare data for RAG systems. There will be no instructions here on how to assemble the next RAG on some GPT – there are a lot of similar articles on habr, you can easily find one that is suitable for you. Here we are talking about the preliminary step – data processing, namely my experience in processing a single closed corporate wiki.

Process flow

This is how I built the initial data processing process after unloading:

  1. Read the page file;

  2. Cleared content from noise;

  3. Divided the cleaned content into parts;

  4. Turned parts into vectors using embeddings;

  5. Uploaded the vectors into the qdrant database, providing them with meta information.

What is embedding

This is a function that converts text into a vector of numbers. Thanks to coordinates, it is possible to identify the similarity of one text to another – the further the distance, the less similar one text is to another in terms of meaning. And accordingly vice versa.

Vectorization feature

An important feature of RAG systems is the way we turn text into a vector, that is, into a certain order of coordinates in multidimensional space. You need to remember that for user requests you need to apply the same embedding as for the context. This is necessary so that the vectors are correct relative to each other in terms of encoding meaning, otherwise the system simply will not work. Sometimes such obvious things escape attention, especially when changing embeddings to process user requests.

What is qdrant

This is a vector database that allows you to store vectors along with meta information. Each record must be of the same vector “dimension” so that it can be searched correctly. You can add any other additional information to each entry – for example, the text corresponding to the vector, the source of this “piece,” the page where it is located, etc.

Further steps greatly depend on the quality of the content, so out of all the steps I will tell you how I built the work in the 2nd step.

I was given many pages in Yandex. Vicki. All are designed in different formats, with different content and different editors at the core. The pages were both distributive, that is, storing only links to other pages, and with the data we needed. In total, I was able to upload about 130 pages in Markdown format.

Why markdown?

Markdown produces the most “clean” text in terms of the information you need. The html format, which is also given to the wiki, has more noise. Its processing even with the help beautifulsoup takes a lot longer than clearing markdown. Previously, I also had experience in parsing pdfs (both one- and two-column formats), everything came down to extracting the text, completely ignoring other elements.

With other formats, things are about the same – there is a lot of noise everywhere, everywhere you need to get rid of it. Therefore, choosing a format, first of all, is from the point of view of simplicity/speed of reducing it to pure text.

For example, the code for one of the pages (others contain corporate information and cannot be disclosed, but this one clearly reflects the essence) in html:

<ol>  
<li>  
<p data-line="0"><strong>Ctrl + C</strong> - Копировать</p>  
</li>  
<li>  
<p data-line="2"><strong>Ctrl + X</strong> - Вырезать</p>  
</li>  
<li>  
<p data-line="4"><strong>Ctrl + V</strong> - Вставить</p>  
</li>  
<li>  
<p data-line="6"><strong>Ctrl + Z</strong> - Отменить</p>  
</li>  
<li>  
<p data-line="8"><strong>Ctrl + Y</strong> - Повторить</p>  
</li>  
<li>  
<p data-line="10"><strong>Ctrl + A</strong> - Выделить все</p>  
</li>  
<li>  
<p data-line="12"><strong>Ctrl + S</strong> - Сохранить</p>  
</li>  
<li>  
<p data-line="14"><strong>Ctrl + N</strong> - Создать новый документ/окно</p>  
</li>  
<li>  
<p data-line="16"><strong>Ctrl + O</strong> - Открыть файл</p>  
</li>  
<li>  
<p data-line="18"><strong>Ctrl + P</strong> - Печать</p>  
</li>  
</ol>

And now the same page in Markdown:

1. **Ctrl + C** - Копировать  
2. **Ctrl + X** - Вырезать  
3. **Ctrl + V** - Вставить  
4. **Ctrl + Z** - Отменить  
5. **Ctrl + Y** - Повторить  
6. **Ctrl + A** - Выделить все  
7. **Ctrl + S** - Сохранить  
8. **Ctrl + N** - Создать новый документ/окно  
9. **Ctrl + O** - Открыть файл  
10. **Ctrl + P** - Печать  

Blank Pages

The wiki found several necessary pages, but completely empty. Such pages are considered “root” – semantic connections between pages are built through them; they cannot be deleted, otherwise the structure will break. There were pages that became blank during the cleaning process. Such pages were immediately deleted when unloading, so as not to slow down the further progress of work.

So, at first I chose the “find the right pages” path. Each file was loaded using marko librariesand for each element the following steps went through recursively:

  1. Clear content from markdown image inserts using the regular expression {%(.*)%};

  2. Clear content from links in markdown format using regular text\[(.*)]\((.*)\);

  3. If the element contains less than 4 characters after this, skip it;

  4. If the element contains more than 4 characters, it is most likely some kind of reasonable text, leave it.

About the cons

This approach did not turn out to be ideal; there were disadvantages.

Firstly, there were still regularly useless pieces of text (for example, breaking off in the middle of a sentence), which only increased the percentage of noise, and therefore reduced the likelihood of an accurate answer from the LLM. Secondly, some text pieces were small, sometimes literally a few words. They were uninformative, and for vectorization I needed the most informative parts of the text, containing complete thoughts and sentences.

I tried to solve the last problem already at the stage of preparing the request – I added the file name to the metainformation of each vector. When receiving parts of text that were relevant in the opinion of qdrant, these names were extracted from them, and the complete files were included in the request to LLM. This method was supposed to help the model see the entire context of the page for a more accurate answer.

A miracle did not happen—the model’s focus, on the contrary, became blurred. If she received several pages of context along with the question, the answer was most often incorrect, roughly speaking about something else altogether. I regularly received responses like “bot bot bot..”, which is a repetition of the initial token in the prompt, and not a correct response. The listing for such a solution is as follows:

import re  
from pathlib import Path  

from llama_index.schema import Document  
from marko import Markdown  
from marko.block import BlankLine, BlockElement  
from marko.element import Element  
from tqdm import tqdm  

from utils import get_index  


def get_docs() -> list[Document]:  
	md = Markdown()  
	pages = []  
	for file in Path('pages/').iterdir():  
    	content = file.read_text()  
    	document = md.parse(content)  
    	for block in document.children:  
        	for text in get_text(block):  
            	doc = Document(text=text, metadata={'file_path': file.name})  
            	doc.excluded_embed_metadata_keys = ['file_path']  
            	pages.append(doc)  
	return pages  

  
def get_text(element: BlockElement | Element) -> list[str]:  
	if type(element) is BlankLine:  
        return []  

	if type(element.children) is str:  
    	text = str(element.children)  
     	match_insert = re.match(r'{%(.*)%}', text)  
    	match_image = re.match(r'!\[(.*)](.*)', text)  
    	if match_insert or match_image or len(text) < 4:  
            return []  
    	return [text]  
	texts = []  

	for el in element.children:  
    	texts.extend(get_text(el))  
	return texts  


def main():  
	docs = get_docs()  
	index = get_index()  
	for doc in tqdm(docs):  
    	index.insert(doc)  


if __name__ == "__main__":  
	main()

Treatment process flow No. 2

In the next step, I reviewed the entire content processing process and made the following changes:

  1. Content cleaning began to occur at regular intervals during loading – so there were a few more fewer files;

  2. Added new replacement regularity: [\n]+ was replaced by one carry \n;

  3. I abandoned marko in favor of dividing content through hyphens.

As a result, the request to LLM began to include only meaningful parts of the text that were returned by the database in response to the request.

These changes helped solve problems with questions that may appear on different pages.

In general, such a system already has the right to life. The answers became more meaningful, where previously there was noise or simply a “crooked” answer – the model began to answer coherently and closer to the truth. You can work with her.

The final listing was as follows:

import re  
from pathlib import Path  
 
from llama_index.core.schema import Document  
from tqdm import tqdm  
 
from config import settings  
from utils import get_index  
 

def get_docs() -> list[Document]:  
	pages = []  
	for file in Path(settings.pages_path).iterdir():  
    	content = file.read_text()  
    	content = re.sub(r'{%(.*)%}', "", content)  
    	content = re.sub(r'\[(.*)]\((.*)\)', "", content)  
    	content = re.sub('[\n]+', '\n', content)  
    	pages.extend(get_doc_by_text(content, file))  
	return pages  


def get_doc_by_text(content: str, file: Path) -> list[Document]:  
	pages = []  
	for block in content.split('\n'):  
        doc = Document(text=block, metadata={'file_path': file.name})  
    	doc.excluded_embed_metadata_keys = ['file_path']  
    	if not is_doc_valid(doc):  
        	continue
        pages.append(doc)  
	return pages  


def is_doc_valid(doc: Document) -> bool:  
	if type(doc.text) is not str:  
        return False  
	return len(doc.text.strip()) > 4  


def main():  
	docs = get_docs()  
	index = get_index()  
	for doc in tqdm(docs):  
        index.insert(doc)  
 
 
if __name__ == "__main__":  
	main()

To the results

From the obvious truths:

  1. The smaller the request size to LLM, the lower the frequency of inaccurate/incorrect answers;

  2. The quality of the answers also depends on the prompt, so you need to tune them further.

Now to my personal conclusions:

  • the simpler the data format, the less resources will be required to clear the content from noise;

  • It is impossible to select a single 100% working cleaning recipe for all documents, but you need to strive for it so as not to clear the data until the second coming;

  • process decomposition allows you to find weaknesses in the cleaning process and optimize/replace them.

Welcome to the comments, I will be glad to receive constructive criticism.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *