Vector databases vs Accuracy – part 1

I decided to “quickly” assemble a local RAG (retrieval augmentation generation), which will find terms from Ozhegov’s dictionary. After studying the Internet, I understood. It all comes down to the recipe (simplified interpretation):

  1. We take the text we need and cut it into pieces

  2. Using the “embedder” we turn it into embeddings

  3. Loading into a vector database

  4. Chaining ChatGPT or an alternative using Langchain

  5. We write a prompt

  6. We rejoice

Scheme of work

Scheme of work

I was not happy for long. After a while I began to understand that he is confused in terms. I ask for one thing, but get something completely different. Let's figure out why.

You can see it in the diagram above. For the answer, “search result + question” is used – you need to check the request and answer from the database. Since we load the dataset ourselves and know what the answer should be, we can automatically check whether the vector database gives us the correct answer to our request.

Preparing the data

We take the text we need and cut it into pieces.

import re
import pandas as pd

with open("ozhegov.txt", mode="r", encoding="UTF-8") as file:
    text_lines = file.readlines()

def return_first_match(pattern, text):
    result = re.findall(pattern,text)
    result = result[0] if result else ""
    return result

data = []

for line in text_lines:

    title = return_first_match(r"^[а-яА-Я]{2,}(?=,)", line)
    text = return_first_match(r"\.\s([А-Я]+.*)\n", line)

    if(len(title) > 3):
        data.append(
            {
                "title": title,
                "text" : text
            })

dataset = pd.DataFrame(data)
dataset.to_csv('ozhegov_dataset.csv')
dataset.head()

title

text

0

SHADE

Cap for a lamp, lamp. Green a. 11 p…

1

ABAZINSKY

Relating to the Abazins, their language, national…

2

ABAZINS

The people living in Karachay-Cherkessia and Adygea…

3

ABBAT

Abbot of a male Catholic monastery. 2…

4

ABBEY

Catholic monastery.

For our experiment, we will not use the entire dataset, but will take only 1000 random pieces.

#бывают пустые в моем датасете
dataset = dataset[dataset['title'].str.len() > 0]
dataset = dataset[dataset['text'].str.len() > 0]
dataset = dataset.sample(n=1000)

dataset.astype({"text": str, "title": str})
dataset.info(show_counts=True)

dataset.head()
<class 'pandas.core.frame.DataFrame'>
Index: 1000 entries, 28934 to 16721
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  1000 non-null   int64 
 1   title       1000 non-null   object
 2   text        1000 non-null   object
dtypes: int64(1), object(2)
memory usage: 31.2+ KB

Unnamed: 0

title

text

28934

THICKENING

A thickened area on something. U. trunk. U. vessel.

30193

BLACK HUNDRED

In Russia at the beginning. 20th century: member of the chauvinist…

14378

IMPOSSIBLE

One that is impossible to overcome. Irresistible…

27420

THERMAL

Related to the application of thermal energy in…

27021

SCHEMATIZE

Present (-vlyat) in the form of a diagram (in 2 values)…

Turning it into embeddings

I didn’t choose “Embedder” for a long time, I used the Rating of Russian-language encoders offers

I won’t tell you about the vector databases themselves and how they work, what they are. Articles have already been written on this matter. I will use Chroma (top 1 from this article).

For the purity of the experiment, I decided to take the top 5 models and 3 different distance functions available in Chroma

models = [
    "intfloat/multilingual-e5-large",
    "sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
    "symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli",
    "cointegrated/LaBSE-en-ru",
    "sentence-transformers/LaBSE"
]

distances = [
    "l2",
    "ip",
    "cosine"
]

Getting ready Chroma

We put the necessary ones pip packages

%pip install -U sentence-transformers ipywidgets chromadb chardet charset-normalizer
There is an error with the installation, there is a solution in the error itself

HINT: This error might have occurred since this system does not have Windows Long Path support enabled. You can find information on how to enable this at https://pip.pypa.io/warnings/enable-long-paths

https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation?tabs=powershell#enable-long-paths-in-windows-10-version-1607-and- later

Let's run Chroma in Docker

docker pull chromadb/chroma
docker run -p 8000:8000 chromadb/chroma

Features for working with Chroma

We define the necessary functions for creating and deleting collections. And also a search function, in which we take a record with the minimum distance + add data by index from the dataset. Indexes in the database are equal to indexes in the dataset.

from chromadb.utils import embedding_functions
import chromadb
chroma_client = chromadb.HttpClient(host="localhost", port=8000)

def create_collection(model_name, distance):
    
    chroma_client = chromadb.HttpClient(host="localhost", port=8000)

    sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_name)
    
    #в этом эксперименте не будем использовать, нам нужно найти термин
    #text_collection = chroma_client.create_collection(name="text", embedding_function=sentence_transformer_ef)
    
    title_collection = chroma_client.create_collection(name="title", embedding_function=sentence_transformer_ef, metadata={"hnsw:space": distance})

    ids = list(map(str, dataset.index.values.tolist()))
    #text_collection.add(ids = ids, documents=dataset["text"].tolist())
    title_collection.add(ids = ids, documents=dataset["title"].tolist())

    return title_collection

def delete_collection():
    chroma_client.delete_collection("title")


def query_collection(collection, query, max_results, dataframe, model_name, distance):
    results = collection.query(query_texts=query, n_results=max_results, include=['distances']) 
    #print(results)
    df = pd.DataFrame({
                'id':results['ids'][0], 
                'score':list(map(float,results['distances'][0])),
                'query': query,
                'title': dataframe[dataframe.index.isin(list(map(int,results['ids'][0])))]['title'],
                'content': dataframe[dataframe.index.isin(list(map(int,results['ids'][0])))]['text'],
                'model_name': model_name,
                'distance': distance
                })
    
    # Забираем с минимальной дистанцией, значит он ближе и больше похож
    df = df[df.score == df.score.min()]
    df['is_found'] = df.apply(lambda row: row.query == row.title, axis=1)
    
    return df

We create a test dataset and start

We form a test dataset of random 100 pieces from the dataset loaded into Chroma.

test_dataset = dataset.sample(n=100)
test_dataset.head()
test_results = pd.DataFrame()

We run and collect the results for each model with a different distance function.

for model in models:
    for distance in distances:
        print(f"{model} - {distance}")
        try:
            delete_collection()
        except Exception as ex:
            print(f"delete_collection error: {ex}")

        collection = create_collection(model, distance)

        for title in test_dataset["title"].tolist():
            test_results = test_results._append(query_collection(
            collection=collection,
            query=title,
            max_results=5,
            dataframe=dataset,
            model_name=model,
            distance=distance))

            print(f"{len(test_results)}")

        

test_results.to_csv("results_ozhegov2.csv")
test_results.head()

id

score

query

title

content

model_name

distance

is_found

10363

10363

1.315708e-12

GERKICHONS

GERKICHONS

Small unripe cucumbers intended for…

intfloat/multilingual-e5-large

l2

True

8566

8566

7.252605e-13

IMMIGRANT

IMMIGRANT

A person who has immigrated somewhere. II immy…

intfloat/multilingual-e5-large

l2

True

12175

17352

1.157366e-12

PENSION

MENSTRUATION

Monthly bleeding from a woman's uterus (…

intfloat/multilingual-e5-large

l2

False

18297

11029

7.939077e-13

BUDDLE

STRAY

Walking without knowing the way, wandering. P. through the forest.

intfloat/multilingual-e5-large

l2

False

14052

5394

1.371903e-12

DECADE

A WEEK

A unit of time equal to seven days…

intfloat/multilingual-e5-large

l2

False

Let's look at the results

Now let's count the number of found (correct) results.

finally_result = pd.DataFrame()
for model in models:
    for distance in distances:
        df = test_results.loc[test_results['model_name'].str.contains(model) == True]
        df = df.loc[df['distance'].str.contains(distance) == True]

        finally_result = finally_result._append(pd.DataFrame({
                'founded': [len(df[df['is_found'] == True])],
                'model_name': [model],
                'distance': [distance]
                }))
        
finally_result.head(15)

founded

model_name

distance

24

intfloat/multilingual-e5-large

l2

24

intfloat/multilingual-e5-large

ip

24

intfloat/multilingual-e5-large

cosine

17

sentence-transformers/paraphrase-multilingual-…

l2

0

sentence-transformers/paraphrase-multilingual-…

ip

19

sentence-transformers/paraphrase-multilingual-…

cosine

23

symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli

l2

25

symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli

ip

21

symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli

cosine

14

cointegrated/LaBSE-en-ru

l2

14

cointegrated/LaBSE-en-ru

ip

14

cointegrated/LaBSE-en-ru

cosine

14

sentence-transformers/LaBSE

l2

14

sentence-transformers/LaBSE

ip

14

sentence-transformers/LaBSE

cosine

Conclusion

It has not yet been possible to assemble a local RAG “quickly” for working with terms and produce a ready-made recipe.

Current ~25% – can hardly be called good accuracy.

What options do I see for solving the problem with accuracy:

  1. Use hybrid search with BM25, not all vector databases support it

    1. Chroma – in progress

    2. Pinecone – public preview – not available in Docker – feature request

    3. Milvus – in progress

    4. Weaviate – supports

    5. Qdrant – supports

    6. Elasticsearch – supports

    7. Opensearch – supports

  2. Attach Postgres + BM25 as a second retriever and search in two at once – it sounds so-so + duplicate information…

  3. Tune the model, but this is far from “quick”

  4. Work with the text (lemmatization, normalization, etc.) – I highly doubt it will help.

Source

In the second part I will try to use hybrid search

PS I’ll wait in the comments for what other option to try, so that I can “quickly” and deploy it locally.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *