we implement BM25+ from the very beginning

Hello, my name is Boris. I am the author of the telegram channel Boris again. Every once in a while something interesting catches my eye and I dig deep into it. In this case, it is the BM25+ search algorithm.

The article began with the fact that I came across a loud and funny result: the BM25 algorithm, developed back in the eighties, won advanced vector search methods on LLM.

Let's figure out what kind of beast this is and why it works so well. BM25 is not only a surprisingly effective method, but also very simple. In this article, we will implement it in Python from scratch. We'll start with the simplest search, move on to TF-IDF, and then take BM25+ out of it.

The article is suitable for those who know nothing at all about search, and more experienced guys can scroll through to the implementation of the algorithm.

Code available in Google Collab.

When the baseline bypassed neural network methods

When the baseline walked around neural network methods

Even in the era of large language model searches, the BM25 holds up well and remains a strong baselane. It is also useful in practice: it can complement modern systems. The output of BM25 can be used as one of the ranking stages, including sending its results to neural networks for further processing.

First, we need data. I liked this one dataset with descriptions of ecommerce products. Let's imagine that we need to do a search for an online store. We will move from the simplest methods, gradually improving them until we arrive at BM25+.

import numpy as np
import pandas as pd
from collections import Counter
from tqdm.auto import tqdm

# https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification
data_url="https://raw.githubusercontent.com/sugatagh/E-commerce-Text-Classification/main/Dataset/ecommerceDataset.csv"
dataset = pd.read_csv(
    data_url,
    names = ['label', 'description']
)
dataset = dataset.dropna()
dataset = dataset.drop_duplicates()

In the dataset, we are only interested in one column, which contains a text description of the product: description. In total we have 27802 such products.

Stupid, naive search

The simplest thing we can do is: having received a user request, find documents that contain it entirely.

documents = dataset.description.tolist()
def search_dummy(documents, query, limit=10):
    return [doc for doc in documents if query in doc][:limit]

[doc[:100] for doc in search_dummy(documents, 'smartphone', limit=2)]
# Вывод:
# ['Shinco 80 cm (32 Inches) HD Ready Smart LED TV SO32AS (Black) (2019 model) Size name:32 Inches   Thi',
#  'Amazon Brand - Solimo 12-inch Wall Clock - Checkered (Silent Movement, Black Frame) Bring function a']

This method is full of problems and the most obvious is that such a query will not return anything: search_dummy(documents, 'SmartphonE', limit=2) .

It is necessary to pre-process the texts. In order not to be too clever, we will remove all symbols except letters and numbers, as well as typical suffixes.

Removing suffixes can be called simple stemming: isolating roots. In practice, more advanced methods are used for this, for example the Porter stemmer from the package nltk. But we are not teaching a course on NLP here.

import string

def stem(word):
    for suffix in set(['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']):
        if word.endswith(suffix):
            return word[:-len(suffix)]
    return word


def preprocess_document(document):
    new_doc="".join(
        c for c in document if c.isalnum() or c == ' '
    ).lower().strip()
    new_doc=" ".join([
        stem(word) for word in new_doc.split(' ')
    ])
    return new_doc

def preprocess_documents(documents):
    new_documents = []
    for doc in documents:
        new_doc = preprocess_document(doc)
        new_documents.append(new_doc)
    return new_documents

documents_preprocessed = preprocess_documents(documents)
documents_preprocessed[:1][0][:50]

# Вывод:
# 'paper plane design fram wall hang motivational off'

def search_dummy(documents, query, limit=10):
    query = preprocess_document(query)
    return [doc for doc in documents if query in doc][:limit]

search_dummy(documents_preprocessed, 'SmartphonE', limit=1)[0][:50]
# Вывод:
# 'shinco 80 cm 32 inches hd ready smart led tv so32a'

Great, problem solved. But any typo in the request immediately breaks everything, for example search_dummy(documents_preprocessed, 'smrtaphone', limit=1) will not return anything. Partial queries like smarwhich are often found in ecommerce.

Term Frequency on N-grams

To solve these problems, N-grams will help us: splitting words in a query into sequences of characters of length N.

from nltk.util import ngrams
import nltk

N_GRAM_SIZE = 3


def documents_to_ngrams(documents, n_gram_size=N_GRAM_SIZE, progress=False):
    document_ngrams = []
    iterator = documents
    if progress:
        iterator = tqdm(documents) # progress bar, т.к. процесс небыстрый
    for doc in iterator:
        doc_ngrams = []
        for word in doc.split(' '):
            word_ngrams = ngrams(word, n_gram_size)
            for ngram in word_ngrams:
                doc_ngrams.append(''.join(ngram))
        document_ngrams.append(tuple(doc_ngrams))
    document_ngrams = tuple(document_ngrams)

    return document_ngrams

documents_ngrams = documents_to_ngrams(documents_preprocessed, progress=True)
documents_ngrams[0][:5]
# Вывод:
# ('pap', 'ape', 'per', 'pla', 'lan')

Now it is no longer enough for us to simply output everything that contains at least one N-gram: we will get a lot of random matches. It is logical to display first those results where there are the most matches.

This is where the concept arises ranking. You need to not only find documents, but also sort them in some order so that the most relevant results are on top. For this you need some ranking function, which associates each pair (query, document) with a number, which is greater the more relevant the document is to the user’s request. In this case, our ranking function is frequency of matches, or term frequency.

def display_search_results(documents, idx_scores, char_limit=100):
    for idx, score in idx_scores:
        print(f'{score:0.2f}: {documents[idx][:char_limit]}')

def query_to_ngrams(query, n_gram_size=N_GRAM_SIZE):
    return documents_to_ngrams([query], n_gram_size=n_gram_size)[0]

def search_tf(documents_ngrams, query, limit=5, n_gram_size=N_GRAM_SIZE):
    index = [Counter(doc_ngrams) for doc_ngrams in documents_ngrams]
    query = query_to_ngrams(query, n_gram_size)

    match_scores = []
    for ngram_counts in tqdm(index):
        score = 0
        for query_ngram in query:
            score += ngram_counts.get(query_ngram, 0)
        match_scores.append(score)

    idx_scores = zip(range(len(documents_ngrams)), match_scores)
    idx_scores = sorted(idx_scores, key=lambda pair: -pair[1])

    return idx_scores[:limit]

idx_scores = search_tf(documents_ngrams, 'smratphone')
display_search_results(documents, idx_scores)

```
Вывод:
116.00: Risk Savvy: How to Make Good Decisions About the Author GERD GIGERENZER is director of the Max Planc
116.00: Risk Savvy: How to Make Good Decisions About the Author Gerd Gigerenzer is the author of Gut Feeling
105.00: HP B4B09PA Headphones with Mic Product Description HP Headphones Overview
With HP B4B09PA Over ear H
98.00: iBall Rocky Over-Ear Headphones with Mic Product Description  iBall Rocky Headset Over-Ear Headphone
96.00: The Global War on Christians: Dispatches from the Front Lines of Anti-Christian Persecution About th
```

Now our search works for queries with typos, but the results are terrible. Let's figure out what's going on.

search_tf performs a search: takes as input N-grams of documents and a user request, returns a list of tuples (индекс документа, релевантность), where relevance is the number of N-gram matches. The results are sorted and the function returns only limit the most relevant documents.

Internally, this function first indexes the documents: for each one, it counts how many different N-grams it contains. For example, in the document "smasmart" contains the following N-grams: {"sma": 2, "max": 1, "asm": 1, "mar": 1, "art": 1}. Such an index allows you to quickly understand how many N-grams from the query are in the document. Strictly speaking, it could be counted once, and not recalculated with each request.

Why is the result so bad? Long documents contain more random matches and will always rank higher in our search results.

Let's look at a toy example:

def documents_to_index(documents, n_gram_size=N_GRAM_SIZE):
		"""Вспомогательная функция, чтобы делать предобработку и разбиение на N-граммы"""
    documents_preprocessed = [
        preprocess_document(doc) for doc in documents
    ]

    documents_ngrams = documents_to_ngrams(documents_preprocessed, n_gram_size)
    return documents_ngrams

dummy_documents = [
    'smartphone',
    'frying pan',
]
dummy_documents_ngrams = documents_to_index(dummy_documents)
idx_scores = search_tf(dummy_documents_ngrams, 'smratphone')
display_search_results(dummy_documents, idx_scores)
```
Вывод:
4.00: smartphone
0.00: frying pan
```

dummy_documents = [
    'smartphone',
    'frying pan',
    'headphones for your smartphone',
]
dummy_documents_ngrams = documents_to_index(dummy_documents)
idx_scores = search_tf(dummy_documents_ngrams, 'smratphone')
display_search_results(dummy_documents, idx_scores)

```
Вывод:
7.00: headphones for your smartphone
4.00: smartphone
0.00: frying pan
```

To solve the problem, we would like to make the N-gram sma contributed more to relevance if it occurs in a document relative to its length. For example, in the word smart it is one of the three N-grams, and in the word smartphone it has less weight.

def search_tf_weighted(documents_ngrams, query, limit=5, n_gram_size=N_GRAM_SIZE):
    index = [Counter(doc_ngrams) for doc_ngrams in documents_ngrams]
    query = query_to_ngrams(query, n_gram_size)
    match_scores = []
    for ngram_counts in index:
        score = 0
        total_ngrams = sum(ngram_counts.values())
        if total_ngrams == 0:
            continue
        for query_ngram in query:
            score += ngram_counts.get(query_ngram, 0)
        score = score/total_ngrams
        match_scores.append(score)

    idx_scores = zip(range(len(documents_ngrams)), match_scores)
    idx_scores = sorted(idx_scores, key=lambda pair: -pair[1])

    return idx_scores[:limit]

idx_scores = search_tf_weighted(dummy_documents_ngrams, 'smratphone')
display_search_results(dummy_documents, idx_scores)
```
Вывод:
0.50: smartphone
0.39: headphones for your smartphone
0.00: frying pan
```

Already better.

Even searching through all documents is no longer useless:

idx_scores = search_tf_weighted(documents_ngrams, 'smratphone')
display_search_results(documents, idx_scores) 
```
Вывод:
0.23: Apple iPhone Xs (Space Grey, 4GB RAM, 64GB Storage, 12 MP Dual Camera, 458 PPI Display) Size name:64
0.19: AmazonBasics Lighting to USB A Cable for iPhone and iPad - 4 Inches (10 Centimeters) - Black (Ideal 
0.18: VALUEACTIVE Screen Guard for Samsung Galaxy S10 Plus Tempered Glass Full Screen Coverage 5D Curved F
0.17: TSIM Lifetime Global SIM Card(1GB) Size:1GB   Voice / SMS Slabs 100 minutes/100 SMS in: Australia, B
0.17: JBL T450BT Extra Bass Wireless On-Ear Headphones with Mic (Black) Colour:Black   Colour:Black Powerf
```

idx_scores = search_tf_weighted(documents_ngrams, 'iphone 7')
display_search_results(documents, idx_scores)
```
0.31: Apple iPhone Xs (Space Grey, 4GB RAM, 64GB Storage, 12 MP Dual Camera, 458 PPI Display) Size name:64
0.24: VALUEACTIVE Screen Guard for Samsung Galaxy S10 Plus Tempered Glass Full Screen Coverage 5D Curved F
0.22: Apple iPhone 7 (Black, 2GB RAM, 32GB Storage) Colour:Black                                          
0.21: TSIM Lifetime Global SIM Card(1GB) Size:1GB   Voice / SMS Slabs 100 minutes/100 SMS in: Australia, B
0.20: AmazonBasics Lighting to USB A Cable for iPhone and iPad - 4 Inches (10 Centimeters) - Black (Ideal 
```

We implemented a real search on term frequency and are a third of the way to BM25+.

But let's try to make the request more specific by adding adjectives:

idx_scores = search_tf_weighted(documents_ngrams, 'the heavy simple beautiful black iphone')
display_search_results(documents, idx_scores)

```
Вывод:
0.86: Dick Francis's Refusal Review Praise for Dick Francis’s Refusal “To the delight of Dick/Felix Franci
0.62: Apple iPhone Xs (Space Grey, 4GB RAM, 64GB Storage, 12 MP Dual Camera, 458 PPI Display) Size name:64
0.38: Betting on Horse Racing For Dummies (For Dummies Series) From the Back Cover Cash in big on multiple
0.38: Seagate 2TB Backup Plus Slim (Blue) USB 3.0 External Hard Drive for PC/Mac with 2 Months Free Adobe 
0.30: ahhaaaa Boy's Cotton Waistcoat, Shirt And Trouser Set This product is made from cotton. Ahhaaaa brin
```

The search failed again. The thing is that for our algorithm all words in the query are equally important, which beautifulWhat theWhat iphone.

Inverse Document Frequency

Let's figure out how we can add weight to each N-gram of a query so that it is greater, the more informative it is. Inverse Document Frequency (IDF) is useful for this: a measure of how rare an N-gram is among all documents. Intuitively, the rarer an N-gram is, the more information it carries. For example, if the occurs in every document, then this N-gram should not be taken into account at all when calculating relevance. And, on the contrary, if sma occurs in one document, its presence in the query allows you to unambiguously determine the most relevant result.

We implement IDF using the standard formula from Wikipedia:

from collections import defaultdict

def calculate_idf(documents_ngrams):
    ngram_appearance = {}
    for doc_ngrams in documents_ngrams:
        for ngram in set(doc_ngrams):
            if ngram not in ngram_appearance:
                ngram_appearance[ngram] = 0
            ngram_appearance[ngram] += 1
    idf = {}
    N = len(documents_ngrams)
    for ngram, appearance_count in ngram_appearance.items():
        idf[ngram] = np.log((1+N)/(1 + appearance_count))
    return idf

dummy_documents = [
    'smartphone',
    'frying pan',
    'headphones for your smartphone',
]
dummy_documents_ngrams = documents_to_index(dummy_documents)
dummy_idf = calculate_idf(dummy_documents_ngrams)
dummy_idf 
```
Вывод:
{'pho': 0.28768207245178085,
 'mar': 0.28768207245178085,
 'tph': 0.28768207245178085,
 'sma': 0.28768207245178085,
 'one': 0.28768207245178085,
 'art': 0.28768207245178085,
 'hon': 0.28768207245178085,
 'rtp': 0.28768207245178085,
 'pan': 0.6931471805599453,
 'fry': 0.6931471805599453,
 'adp': 0.6931471805599453,
 'our': 0.6931471805599453,
 'for': 0.6931471805599453,
 'you': 0.6931471805599453,
 'ead': 0.6931471805599453,
 'dph': 0.6931471805599453,
 'hea': 0.6931471805599453}
```

It can be seen that N-grams related to the frying pan received a higher weight, since they are less common.

Now the relevance of the document is equal to the sum of the frequencies of N-grams from the query multiplied by the information content of these N-grams.

It's called TF-IDF:

score(D, Q) = \sum_{i=1}^{n} IDF(q_i) TF(q_i, D)

Where D this is a document Q this is a request and q_i This is the N-gram from the document. Note that IDF is not document-specific because is calculated over all documents in our collection for each N-gram.

IDF and TF can be calculated in different ways depending on the problem.

Let's implement this search algorithm:

def search_tf_idf(
    documents_ngrams,
    query,
    limit=5,
    n_gram_size=N_GRAM_SIZE,
):
    index = [Counter(doc_ngrams) for doc_ngrams in documents_ngrams]
    idf = calculate_idf(documents_ngrams)
    query = query_to_ngrams(query, n_gram_size)

    match_scores = []
    for ngram_counts in tqdm(index):
        score = 0
        total_ngrams = sum(ngram_counts.values())
        if total_ngrams == 0:
            continue
        for query_ngram in query:
            tf_score = ngram_counts.get(query_ngram, 0)/total_ngrams
            idf_score = idf.get(query_ngram, 1e-3)
            score += tf_score * idf_score
        match_scores.append(score)

    idx_scores = zip(range(len(documents_ngrams)), match_scores)
    idx_scores = sorted(idx_scores, key=lambda pair: -pair[1])

    return idx_scores[:limit]

dummy_documents = [
    'smartphone',
    'frying pan',
    'headphones for your smartphone',
]
dummy_documents_ngrams = documents_to_index(dummy_documents)
idx_scores = search_tf_idf(dummy_documents_ngrams, 'smratphone')
display_search_results(dummy_documents, idx_scores)
```
Вывод:
0.14: smartphone
0.11: headphones for your smartphone
0.00: frying pan
```

Full search works better, but still not perfect:

idx_scores = search_tf_idf(documents_ngrams, 'health smartwatch')
display_search_results(documents, idx_scores)
```
Вывод:
0.96: A History of American Literature Review "Richard Gray's real achievement is somehow to have compress
0.88: Samsung Gear Sport Smartwatch (Black) Colour:Black   Sports watch design in premium stainless steel 
0.85: American Pastoral Amazon.com Review Philip Roth's 22nd book takes a life-long view of the American e
0.53: M3 Intelligence Bluetooth Health Wrist Smart Band Watch Monitor/Smart Bracelet/Health Bracelet/Activ
0.52: Textbook of Elements of Veterinary Public Health Veterinary Public Health is a multidisciplinary fie
```

BM25+

In fact, BM25+ is a special case of TF-IDF with a specific choice of TF and IDF. That's why it took us so long to get to him.

His idea is to make weights that will not be so strongly influenced by the length of the documents.

It also has grounds from Information Theory: IDF in BM25+ is essentially Shannon’s own information, that is, how much information a given N-gram contains.

Additionally, BM25+ works best on words rather than N-grams.

Formulas (scary):

TF(q_i, D) = \frac{(f(q_i, D) \times (k_1 + 1)}{(f(q_i, D) \times (k_1 + 1) + (1 - b + b \cdot \ frac{|D|}{avgdl})} + \deltaIDF(q_i) = ln(\frac{N - n(q_i) + 0.5}{n(q_i) + 0.5} + 1)score(D, Q) = \sum_{i=1}^{n} IDF(q_i) TF(q_i, D)

Here:

In fact TF is the weight of a word in a documentA IDF is the information content of this word. Documents in which rare words occupy a lot of space receive the greatest relevance.

Free parameters control the behavior of the ranking function. They can be left as is or selected for the task.

How do IDF and TF behave in BM25+?

Let's take a look at TF and IDF to understand what exactly these scales are.

IDF:

IDF depending on the number of documents containing the word.

IDF based on the number of documents containing the word.

The information content of rare words is very high. The difference between the information content of a word found in one document and a word found in two is enormous. However, the more frequently a word occurs, the more saturation occurs: there is virtually no difference between a word contained in 40,000 documents and 60,000 documents.

TF:

TF depending on occurrences of a word in a document and the length of the document.

TF depending on occurrences of a word in a document and the length of the document.

It can be seen that the more often a word appears in a document, the higher its weight. However, there is a saturation effect: between 10 occurrences and 20 occurrences the difference is small. Also, the weight is higher if the document is short. All this helps reduce the impact of long documents and random coincidences.

Implementation of BM25+

All that remains is to implement everything according to the formulas. This time we will do everything beautifully, wrapping the code in a class, and we will not recalculate the index with each request.

def documents_to_words(documents):
    documents_words = tuple([doc.split(' ') for doc in documents])
    documents_words = tuple(documents_words)
    return documents_words

def bm25_documents_to_index(documents):
    documents_preprocessed = [
        preprocess_document(doc) for doc in documents
    ]

    documents_words = documents_to_words(documents_preprocessed)
    return documents_words

def bm25_query_to_wrods(query):
    return bm25_documents_to_index([query])[0]

def idf_bm25(
    number_documents_containing_ngram,
    total_documents,
):
    x = (total_documents - number_documents_containing_ngram + 0.5)/(number_documents_containing_ngram + 0.5)
    return np.log(x + 1)

def tf_bm25(ngram_tf, document_length, average_document_length, k1=1.5, b=0.75, delta=1):
    numerator = ngram_tf*(k1+1)
    denominator = ngram_tf + (k1 * (1 - b + b * document_length/average_document_length))
    return numerator/denominator + delta

def bm25_score(
    ngram_idf,
    ngram_tf,
    document_length,
    average_document_length,
    k1=1.5,
    b=0.75,
):
    numerator = ngram_tf*(k1+1)
    denominator = ngram_tf + (k1 * (1 - b + b * document_length/average_document_length))
    return ngram_idf * tf_bm25(ngram_tf, document_length, average_document_length, k1=k1, b=b)


class SearchBM25:
    def __init__(self):
        self.documents = None
        self.documents_ngrams = None
        self.tf = None
        self.idf = None

    def calculate_tf(self, documents_ngrams):
        tf = [Counter(doc_ngrams) for doc_ngrams in documents_ngrams]

        return tf

    def calculate_idf(self, tf, documents_ngrams):
        idf = {}

        documents_containing = {}

        for doc_tf in tqdm(tf):
            for ngram in doc_tf.keys():
                if not ngram in documents_containing:
                    documents_containing[ngram] = 0
                documents_containing[ngram] += 1

        for ngram in tqdm(documents_containing.keys()):
            idf[ngram] = idf_bm25(
                number_documents_containing_ngram=documents_containing[ngram],
                total_documents=len(documents_ngrams),
            )
        return idf

    def fit(
        self,
        documents,
    ):
        self.documents = documents

        self.documents_ngrams = bm25_documents_to_index(
            documents,
        )

        self.tf = self.calculate_tf(self.documents_ngrams)
        self.idf = self.calculate_idf(self.tf, self.documents_ngrams)

    def search_bm25(
        self,
        query,
        limit,
        only_documents=None,
    ):
        avg_document_length = sum([
            len(doc) for doc in self.documents_ngrams
        ])/len(self.documents_ngrams)
        query = bm25_query_to_wrods(query)
        indexes = []
        match_scores = []

        document_indexes = range(len(self.tf)) if only_documents is None else only_documents
        for i in document_indexes:
            document_tf = self.tf[i]

            document_length = sum(document_tf.values())
            if document_length == 0:
                continue

            score = 0
            for query_ngram in query:
                ngram_score = bm25_score(
                    self.idf.get(query_ngram, 1e-6),
                    document_tf.get(query_ngram, 1e-6),
                    document_length=document_length,
                    average_document_length=avg_document_length,
                )
                score += ngram_score
            match_scores.append(score)
            indexes.append(i)

        idx_scores = zip(indexes, match_scores)
        idx_scores = sorted(idx_scores, key=lambda pair: -pair[1])
        return idx_scores[:limit]


    def search(self, query, limit=5):
        idx_scores = self.search_bm25(
            query,
            limit=limit,
        )
        return idx_scores[:limit]

    def search_and_display(self, query, limit=5, char_limit=100):
        idx_scores = self.search(query, limit=limit)
        display_search_results(self.documents, idx_scores, char_limit=char_limit)


index = SearchBM25()
index.fit(documents)

index.search_and_display('frying')
```
Вывод:
16.30: Taylor Candy and Deep Fry Stainless Steel Thermometer TAYLOR 5983N Candy/Jelly Deep Fry Thermometer
16.02: Bhavya enterprises Stainless Steel Deep Fat Fryer 8+8 L (Silver) Frying machine used in commercial s
15.96: Hk Villa 2 In 1 Multifuctional Steaming Device egg pan Frying Egg Boiling Roasting Heating Electric 
15.71: HomeFast Multifunction Non-Stick Electric Frying Pan Egg Omelette Pancakes Steak Egg Boiler Electric
15.66: Vishal Smart Mall Multifunction Non-Stick Electric Frying Pan Egg Omelette Pancakes Steak Egg Boiler
```

index.search_and_display('iphone 6s')
```
21.93: Mpow iPhone 6 6s iPhone 7 8 Armband, [Ultra Comfortable] Adjustable Sports Armband Sweatproof Runnin
21.39: MADANYU The Man in Suit for Real Man Designer Printed Hard Back Shell Case for Apple iPhone 6S / App
21.22: POPIO Tempered Glass Screen Protector For iPhone 6 / iPhone 6S / iPhone 7 / iPhone 8 (Pack Of 2) Col
21.21: UNMCORE IPhone 8 Pin Lightning To USB Fast Data Charging Sync Cable Wall Charger With USB Power Adap
21.00: Apple iPhone 6 (Gold, 1GB RAM, 32GB Storage) Colour:Gold   Apple iPhone 6 (Gold, 32GB)
```

Search works well for some queries, but TF-IDF performs worse for others.

BM25+ is quite sensitive to proper text preprocessing. Its other problem is that in this implementation it is not typo-tolerant and is not able to handle partial queries.

Two-stage search: TF-IDF + BM25+

We can combine all the advantages of TF-IDF and BM25+. To do this, we implement a two-stage search: first, TF-IDF on N-grams will search a large number of candidate documents, then BM25+ will re-rank these documents and return the best ones.

from scipy.stats import gmean

class SearchTFIDF:
    def __init__(self, n_gram_size=N_GRAM_SIZE):
        self.n_gram_size = n_gram_size

        self.documents = None
        self.documents_ngrams = None

    def fit(
        self,
        documents,
    ):
        self.documents = documents

        self.documents_ngrams = documents_to_index(
            documents,
            n_gram_size=self.n_gram_size,
        )

    def search(self, query, limit=5):
        idx_scores = search_tf_idf(
            self.documents_ngrams,
            query,
            limit=limit,
            n_gram_size=self.n_gram_size,
        )
        return idx_scores[:limit]

    def search_and_display(self, query, limit=5):
        idx_scores = self.search(query, limit=limit)
        display_search_results(self.documents, idx_scores)
        
class TwoStageSearch:
    def __init__(self, n_gram_size=3):
        self.n_gram_size = n_gram_size
        self.documents = None
        self.tfidf_index = None
        self.bm25_index = None

    def fit(
        self,
        documents,
    ):
        self.documents = documents

        self.tfidf_index = SearchTFIDF(n_gram_size=self.n_gram_size)
        self.tfidf_index.fit(self.documents)

        self.bm25_index = SearchBM25()
        self.bm25_index.fit(self.documents)

    def search(self, query, limit_stage1=100, limit_stage2=5):
        idx_scores_stage1 = self.tfidf_index.search(query, limit=limit_stage1)
        idx_scores_stage1 = [p for p in idx_scores_stage1 if p[1] > 1e-05]
        idx_to_score_stage1 = {
            idx: score for idx, score in idx_scores_stage1
        }
        only_document_indexes = list(idx_to_score_stage1.keys())
        idx_scores_stage2 = self.bm25_index.search_bm25(query, limit=limit_stage2, only_documents=only_document_indexes)

        aggregated_scores = {
            idx: gmean([score, idx_to_score_stage1[idx]]) for idx, score in idx_scores_stage2
        }
        idx_scores = [(idx, idx_to_score_stage1[idx], score, aggregated_scores[idx]) for idx, score in idx_scores_stage2]

        idx_scores = sorted(idx_scores, key=lambda x: (-round(x[-1],3), -round(x[-2],3), -round(x[-3], 3)))

        return idx_scores

    def display_search_results(self, idx_scores, char_limit=100):
        for idx, score_stage1, score_stage2, score_combined in idx_scores:
            print(f'{score_stage1:0.2f}|{score_stage2:0.2f}|{score_combined:0.2f}: {self.documents[idx][:char_limit]}')

    def search_and_display(self, query, limit_stage1=100, limit_stage2=5, char_limit=100):
        idx_scores = self.search(query, limit_stage1=limit_stage1, limit_stage2=limit_stage2)
        self.display_search_results(idx_scores, char_limit=char_limit)

The code may be scary, but the logic is simple:

  1. Building indexes for TF-IDF and BM25+

  2. When a request arrives, we first run TF-IDF and get the 100 most relevant results.

  3. We launch BM25+ on these results and get the 5 most relevant ones.

  4. To sort the results, we average the estimates obtained from TF-IDF and BM25+ with a geometric mean (unlike the arithmetic mean, it is resistant to the fact that the estimates may be on different scales).

  5. We sort the results by average scores. In case of a tie, we sort by the score from TF-IDF.

Let's try it.

index = TwoStageSearch()
index.fit(documents)
index.search_and_display('iph')

```
Вывод:
0.12|0.00|0.00: Apple iPhone 8 (Space Grey, 2GB RAM, 64GB Storage)
0.09|0.00|0.00: Natation 3D VR Box Virtual Reality Glasses (VR_Headset) (VR Basic)
0.08|0.00|0.00: Tidy Up! Wire Bin (Black)
0.06|0.00|0.00: Orion Premium Telescope Accessory Kit - 1.25" Orion Orion 08890 1.25-Inch Premium Telescope Accessor
0.04|0.00|0.00: Apple iPhone 6S (Rose Gold, 2GB RAM, 32GB Storage)
```

index.search_and_display('iphone xs 64GB')
```
Вывод:
0.50|7.29|1.91: Apple iPhone 8 Plus (Space Grey, 64GB) Colour:Space Grey                                            
0.36|7.49|1.64: SKYVIK Beam QI Certified 7.5W & 10W Fast Wireless Charger for iPhone X XS Max XR iPhone 8 & 8 Plus S
0.29|7.40|1.46: Apple iPhone 8 (Space Grey, 2GB RAM, 64GB Storage)
0.19|8.58|1.28: STORETHATSAYS Vivo V9 Compatible Camera Tripod Portable & Foldable Stand with Mobile Clip Holder Com
0.19|8.56|1.28: STORETHATSAYS Google Pixel 2 and 2 XL Compatible Camera Tripod Portable & Foldable Stand with Mobile
```

index.search_and_display('iphone charger')
```
Вывод:
0.81|15.16|3.50: SKYVIK Beam QI Certified 7.5W & 10W Fast Wireless Charger for iPhone X XS Max XR iPhone 8 & 8 Plus S
0.36|16.79|2.47: Belkin Boost Up Wireless Charging Pad 7.5W - Wireless Charger Optimized for iPhone, Compatible with 
0.31|14.71|2.13: Baseus [Certified] Fast Qi Wireless Charger Leather Pad Stand Baseus Wireless Charger For Iphone X 8
0.31|13.87|2.06: Syncwire Lightning Charger Cable Cord for iPhone 5 - iPhoneX Smartphones, iPad Mini, iPad Air, iPad 
0.26|14.87|1.97: Baseus B® Smart 2 in 1 Wireless Charger for Apple IWatch and Qi Enabled Phone Charger for Apple iPho
```

Our search now handles partial queries successfully, is typo-resistant, and produces useful results. Of course, he is still far from a real search in production.


In this paper, we implemented the BM25+ algorithm and used it to re-rank the results of another search algorithm. It is worth noting the limitations of BM25+:

  • It doesn’t understand anything about the meaning, unlike vector search on neural network models.

  • As a consequence of the previous point, he does not understand anything about synonyms.

  • Without additional optimizations, it requires traversing the entire collection of documents. The neural network Approximate Nearest Neighbors search does not suffer from this problem.

If you want to practice, here are some ideas that you can implement:

  • Make it so that you can add a new document to the index without recalculating the whole thing.

  • Speed ​​up! The implementation above is completely naive, there is huge room for improvement.

  • Add a smarter tokenizer than word splitting. For example, you can look at FastText.

  • Implement the NDCG ranking quality metric and check the quality of the results. Then select the best parameters for BM25+.


If you liked this stuff, you might like other things I write: subscribe to my channel on telegramwhere I post useful content about IT, machine learning and more too often (for my psyche).

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *