Vector databases vs Accuracy - part 1

I decided to “quickly” assemble a local RAG (retrieval augmentation generation), which will find terms from Ozhegov’s dictionary. After studying the Internet, I understood. It all comes down to the recipe (simplified interpretation):

We take the text we need and cut it into pieces
Using the “embedder” we turn it into embeddings
Loading into a vector database
Chaining ChatGPT or an alternative using Langchain
We write a prompt
We rejoice

I was not happy for long. After a while I began to understand that he is confused in terms. I ask for one thing, but get something completely different. Let's figure out why.

You can see it in the diagram above. For the answer, “search result + question” is used – you need to check the request and answer from the database. Since we load the dataset ourselves and know what the answer should be, we can automatically check whether the vector database gives us the correct answer to our request.

Preparing the data

We take the text we need and cut it into pieces.

import re
import pandas as pd

with open("ozhegov.txt", mode="r", encoding="UTF-8") as file:
    text_lines = file.readlines()

def return_first_match(pattern, text):
    result = re.findall(pattern,text)
    result = result[0] if result else ""
    return result

data = []

for line in text_lines:

    title = return_first_match(r"^[а-яА-Я]{2,}(?=,)", line)
    text = return_first_match(r"\.\s([А-Я]+.*)\n", line)

    if(len(title) > 3):
        data.append(
            {
                "title": title,
                "text" : text
            })

dataset = pd.DataFrame(data)
dataset.to_csv('ozhegov_dataset.csv')
dataset.head()

	title	text
0	SHADE	Cap for a lamp, lamp. Green a. 11 p…
1	ABAZINSKY	Relating to the Abazins, their language, national…
2	ABAZINS	The people living in Karachay-Cherkessia and Adygea…
3	ABBAT	Abbot of a male Catholic monastery. 2…
4	ABBEY	Catholic monastery.

For our experiment, we will not use the entire dataset, but will take only 1000 random pieces.

#бывают пустые в моем датасете
dataset = dataset[dataset['title'].str.len() > 0]
dataset = dataset[dataset['text'].str.len() > 0]
dataset = dataset.sample(n=1000)

dataset.astype({"text": str, "title": str})
dataset.info(show_counts=True)

dataset.head()

<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 28934 to 16721
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  1000 non-null   int64 
 1   title       1000 non-null   object
 2   text        1000 non-null   object
dtypes: int64(1), object(2)
memory usage: 31.2+ KB

Unnamed: 0	title	text
28934	THICKENING	A thickened area on something. U. trunk. U. vessel.
30193	BLACK HUNDRED	In Russia at the beginning. 20th century: member of the chauvinist…
14378	IMPOSSIBLE	One that is impossible to overcome. Irresistible…
27420	THERMAL	Related to the application of thermal energy in…
27021	SCHEMATIZE	Present (-vlyat) in the form of a diagram (in 2 values)…

Turning it into embeddings

I didn’t choose “Embedder” for a long time, I used the Rating of Russian-language encoders offers

I won’t tell you about the vector databases themselves and how they work, what they are. Articles have already been written on this matter. I will use Chroma (top 1 from this article).

For the purity of the experiment, I decided to take the top 5 models and 3 different distance functions available in Chroma

models = [
    "intfloat/multilingual-e5-large",
    "sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
    "symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli",
    "cointegrated/LaBSE-en-ru",
    "sentence-transformers/LaBSE"
]

distances = [
    "l2",
    "ip",
    "cosine"
]

Getting ready Chroma

We put the necessary ones pip packages

%pip install -U sentence-transformers ipywidgets chromadb chardet charset-normalizer

There is an error with the installation, there is a solution in the error itself

HINT: This error might have occurred since this system does not have Windows Long Path support enabled. You can find information on how to enable this at https://pip.pypa.io/warnings/enable-long-paths

https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=powershell#enable-long-paths-in-windows-10-version-1607-and- later

Let's run Chroma in Docker

docker pull chromadb/chroma
docker run -p 8000:8000 chromadb/chroma

Features for working with Chroma

We define the necessary functions for creating and deleting collections. And also a search function, in which we take a record with the minimum distance + add data by index from the dataset. Indexes in the database are equal to indexes in the dataset.

from chromadb.utils import embedding_functions
import chromadb
chroma_client = chromadb.HttpClient(host="localhost", port=8000)

def create_collection(model_name, distance):
    
    chroma_client = chromadb.HttpClient(host="localhost", port=8000)

    sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)
    
    #в этом эксперименте не будем использовать, нам нужно найти термин
    #text_collection = chroma_client.create_collection(name="text", embedding_function=sentence_transformer_ef)
    
    title_collection = chroma_client.create_collection(name="title", embedding_function=sentence_transformer_ef, metadata={"hnsw:space": distance})

    ids = list(map(str, dataset.index.values.tolist()))
    #text_collection.add(ids = ids, documents=dataset["text"].tolist())
    title_collection.add(ids = ids, documents=dataset["title"].tolist())

    return title_collection

def delete_collection():
    chroma_client.delete_collection("title")


def query_collection(collection, query, max_results, dataframe, model_name, distance):
    results = collection.query(query_texts=query, n_results=max_results, include=['distances']) 
    #print(results)
    df = pd.DataFrame({
                'id':results['ids'][0], 
                'score':list(map(float,results['distances'][0])),
                'query': query,
                'title': dataframe[dataframe.index.isin(list(map(int,results['ids'][0])))]['title'],
                'content': dataframe[dataframe.index.isin(list(map(int,results['ids'][0])))]['text'],
                'model_name': model_name,
                'distance': distance
                })
    
    # Забираем с минимальной дистанцией, значит он ближе и больше похож
    df = df[df.score == df.score.min()]
    df['is_found'] = df.apply(lambda row: row.query == row.title, axis=1)
    
    return df

We create a test dataset and start

We form a test dataset of random 100 pieces from the dataset loaded into Chroma.

test_dataset = dataset.sample(n=100)
test_dataset.head()
test_results = pd.DataFrame()

We run and collect the results for each model with a different distance function.

for model in models:
    for distance in distances:
        print(f"{model} - {distance}")
        try:
            delete_collection()
        except Exception as ex:
            print(f"delete_collection error: {ex}")

        collection = create_collection(model, distance)

        for title in test_dataset["title"].tolist():
            test_results = test_results._append(query_collection(
            collection=collection,
            query=title,
            max_results=5,
            dataframe=dataset,
            model_name=model,
            distance=distance))

            print(f"{len(test_results)}")

        

test_results.to_csv("results_ozhegov2.csv")
test_results.head()

	id	score	query	title	content	model_name	distance	is_found
10363	10363	1.315708e-12	GERKICHONS	GERKICHONS	Small unripe cucumbers intended for…	intfloat/multilingual-e5-large	l2	True
8566	8566	7.252605e-13	IMMIGRANT	IMMIGRANT	A person who has immigrated somewhere. II immy…	intfloat/multilingual-e5-large	l2	True
12175	17352	1.157366e-12	PENSION	MENSTRUATION	Monthly bleeding from a woman's uterus (…	intfloat/multilingual-e5-large	l2	False
18297	11029	7.939077e-13	BUDDLE	STRAY	Walking without knowing the way, wandering. P. through the forest.	intfloat/multilingual-e5-large	l2	False
14052	5394	1.371903e-12	DECADE	A WEEK	A unit of time equal to seven days…	intfloat/multilingual-e5-large	l2	False

Let's look at the results

Now let's count the number of found (correct) results.

finally_result = pd.DataFrame()
for model in models:
    for distance in distances:
        df = test_results.loc[test_results['model_name'].str.contains(model) == True]
        df = df.loc[df['distance'].str.contains(distance) == True]

        finally_result = finally_result._append(pd.DataFrame({
                'founded': [len(df[df['is_found'] == True])],
                'model_name': [model],
                'distance': [distance]
                }))
        
finally_result.head(15)

founded	model_name	distance
24	intfloat/multilingual-e5-large	l2
24	intfloat/multilingual-e5-large	ip
24	intfloat/multilingual-e5-large	cosine
17	sentence-transformers/paraphrase-multilingual-…	l2
0	sentence-transformers/paraphrase-multilingual-…	ip
19	sentence-transformers/paraphrase-multilingual-…	cosine
23	symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli	l2
25	symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli	ip
21	symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli	cosine
14	cointegrated/LaBSE-en-ru	l2
14	cointegrated/LaBSE-en-ru	ip
14	cointegrated/LaBSE-en-ru	cosine
14	sentence-transformers/LaBSE	l2
14	sentence-transformers/LaBSE	ip
14	sentence-transformers/LaBSE	cosine

Conclusion

It has not yet been possible to assemble a local RAG “quickly” for working with terms and produce a ready-made recipe.

Current ~25% – can hardly be called good accuracy.

What options do I see for solving the problem with accuracy:

Use hybrid search with BM25, not all vector databases support it
1. Chroma – in progress
2. Pinecone – public preview – not available in Docker – feature request
3. Milvus – in progress
4. Weaviate – supports
5. Qdrant – supports
6. Elasticsearch – supports
7. Opensearch – supports
Attach Postgres + BM25 as a second retriever and search in two at once – it sounds so-so + duplicate information…
Tune the model, but this is far from “quick”
Work with the text (lemmatization, normalization, etc.) – I highly doubt it will help.

Source
In the second part I will try to use hybrid search
PS I’ll wait in the comments for what other option to try, so that I can “quickly” and deploy it locally.

Vector databases vs Accuracy – part 1

Preparing the data

Turning it into embeddings

Getting ready Chroma

Features for working with Chroma

We create a test dataset and start

Let's look at the results

Conclusion

Source
In the second part I will try to use hybrid search
PS I’ll wait in the comments for what other option to try, so that I can “quickly” and deploy it locally.

How to increase the speed of decision making in a company by implementing research without a budget

The big Martian problem is energy

Parallels isn’t just Windows on Mac

Announcement of the new Blender Studio project

Come out and come in normally: why do IT universities train builders and is it possible to make humanities a developer

Ford’s AI to boost electric scooter culture

Leave a Reply Cancel reply

Preparing the data

Turning it into embeddings

Getting ready Chroma

Features for working with Chroma

We create a test dataset and start

Let's look at the results

Conclusion

SourceIn the second part I will try to use hybrid searchPS I’ll wait in the comments for what other option to try, so that I can “quickly” and deploy it locally.

Similar Posts

Leave a Reply Cancel reply

Source
In the second part I will try to use hybrid search
PS I’ll wait in the comments for what other option to try, so that I can “quickly” and deploy it locally.