setting up and customizing the scoring model

There's no magic “Make Search Perfect” button, but there are ways to control the model scoring.

BM25

First, let's refresh our basic knowledge. The BM25 is an improved version of the classic model TF-IDF. BM25 is considered smarter because it takes into account not only the frequency of a term in a document, but also its length. This allows the algorithm to better cope with long and short documents.

The main parameters of BM25 are: k1 And b. Their settings affect how the algorithm evaluates the relevance of a document relative to a search query.

Parameters k1 and b

Parameter k1: the intensity of growth of the term's influence

This parameter controls how much the document's score will increase when new occurrences of the term are added to it. The default value is k1 equals 1.2which is a compromise option for most cases.

For example, if there is a document with a large number of repetitions of a term, with a value of k1=1.2 each new repetition of the term will affect the final _score, but not linearly. As k1 the term's influence on the result will grow more aggressively. If k1 equals 0, then the term frequency will not be taken into account at all.

Example of setting up an index with a modified k1:

PUT /my-index
{
  "settings": {
    "index": {
      "similarity": {
        "my_bm25": {
          "type": "BM25",
          "k1": 1.8,  // Увеличиваем значение k1 для более агрессивного учета частоты термина
          "b": 0.75   // Параметр b оставим по умолчанию
        }
      }
    }
  }
}

When to increase k1:

If documents contain terms that occur frequently and you want to enhance their impact on the result, or when working with long documents, where it is important to take into account each additional repetition of a term.

When to reduce k1:

If the documents are short and the keywords are rare. In this case, a low value k1 will make every entry more significant.

Parameter b: normalization by document length

Parameter b controls how much the length of a document will be taken into account when assessing its relevance. The default value is 0.75. That is, the length of the document influences the result, but not too aggressively.

Example:

PUT /my-index
{
  "settings": {
    "index": {
      "similarity": {
        "my_bm25": {
          "type": "BM25",
          "k1": 1.2,
          "b": 0.5   // Уменьшаем влияние длины документа
        }
      }
    }
  }
}

When to increase b:

If there are a lot of long documents, and you want them not to “outweigh” short documents with relevant terms.

When document length is important for search accuracy (e.g. long texts may contain a lot of “noise”).

When to reduce b:

If the documents are approximately the same length, then it is better to reduce the influence of length and focus on other factors (term frequency, popularity).

How does all this affect search results?

Now let's see how these parameters affect the result. Let's say there are two documents:

  1. Document 1: “Elasticsearch Elasticsearch Elasticsearch is all about search.”

  2. Document 2: “Elasticsearch is a cool search engine.”

With the default settings k1=1.2, b=0.75 Document 1 will have a higher _score because it contains more occurrences of the term “Elasticsearch”. But if we decrease k1then the term frequency will no longer have such a strong influence on the final result, and Document 2 may appear higher in the search results if it is relevant by other parameters.

Example query using the new model:

GET /my-index/_search
{
  "query": {
    "match": {
      "title": "Elasticsearch"
    }
  }
}

BM25 is a powerful tool for improving search relevance in Elasticsearch, but it needs to be configured correctly.

Let's move on to the next point of the article – custom ranking models using Function Score and Painless.

Custom Ranking Models with Function Score and Painless

Ranking is the soul of search. When a user types a query, they expect to see relevant results, not random documents. But what if the usual ranking based on BM25 or TF-IDF not enough? For example, it is necessary to take into account not only the occurrences of the term in the document, but also such metrics as popularity goods, quantity views articles or influence time since the last update.

Function Score Query And Painless Script make it possible to modify the _score calculation and create custom models that are suitable for a particular case.

Function Score

Function Score Query — is a query that allows you to change the final _score of a document by applying various functions to the search results.

Example of a query based on product popularity:

GET /products/_search
{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "description": "laptop"
        }
      },
      "field_value_factor": {
        "field": "popularity",   // поле, содержащее популярность товара
        "factor": 1.2,           // множитель
        "modifier": "log1p"      // тип модификации значения
      }
    }
  }
}

This query doesn't just search for products by the word “laptop”, but also takes into account the value of the field popularity to calculate _score. We multiply popularity by the factor 1.2 and apply logarithmic modification through log1pto reduce the influence of large values.

When to use Function Score:

When there is a field that reflects the popularity of a document (for example, the number of product sales or article views), and you want it to influence search results.

Painless

Sometimes a function just isn't enough. You need something more to accommodate dynamic changes, more complex calculations, or conditional dependencies. That's where Painless Script — a built-in scripting language in Elasticsearch that allows you to embed logic directly into queries.

Let's say we need to take into account time decay (for example, how long the document has been relevant). In this case, Painless Script will be good:

GET /articles/_search
{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "title": "Elasticsearch"
        }
      },
      "script_score": {
        "script": {
          "source": "doc['popularity'].value / Math.log(1 + (params.now - doc['publish_date'].value))",
          "params": {
            "now": "2024-09-21T00:00:00"
          }
        }
      }
    }
  }
}

Let's combine popularity with time decay, where older articles will get less weight even if they are popular. Function Math.log helps control the influence of time on the final _score.

Another scenario is to consider how long a product or article remains relevant, and over time, its importance should decrease. There are ready-made decay functionssuch as exp, gauss And linear.

Example with exp function:

GET /articles/_search
{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "title": "AI"
        }
      },
      "exp": {
        "publish_date": {
          "origin": "now",
          "scale": "10d",   // каждые 10 дней значимость уменьшается
          "offset": "5d",   // первые 5 дней документ не теряет значимость
          "decay": 0.5      // каждое десятидневное окно снижает значимость на 50%
        }
      }
    }
  }
}

Function exp works as an exponential decrease in document weight depending on the publication date. Thus, new articles will be ranked higher, and older ones will gradually lose their positions.

Combining functions and scripts

The biggest strength of Function Score and Painless Script is the ability to combine them to create complex ranking models.

An example of combining weight by popularity and decay by time:

GET /articles/_search
{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "content": "Elasticsearch"
        }
      },
      "functions": [
        {
          "field_value_factor": {
            "field": "popularity",
            "factor": 1.5,
            "modifier": "sqrt"
          }
        },
        {
          "exp": {
            "publish_date": {
              "origin": "now",
              "scale": "30d",
              "offset": "7d",
              "decay": 0.6
            }
          }
        }
      ],
      "score_mode": "multiply",  // комбинируем результаты функций через умножение
      "boost_mode": "sum"        // добавляем boost от основной query
    }
  }
}

Here we combine two factors: popularity (through field_value_factor) and relevance (through exp decay), and then multiply the results of both functions to calculate the final _score.

Write queries, customize rankings, and let your Elasticsearch find exactly what you need!

Now we move on to setting up via Field Boosting and multipoles.

Tuning via Field Boosting and Multipoles

As we know, Elasticsearch uses _score to determine the relevance of a document to a query. The standard ranking model (e.g. BM25) already helps to take into account term frequency, document length and other factors. But what if you need to tell the search engine: “Field A is more important than field B”? This is where Field Boosting.

Boosting allows you to change the weight of fields or terms, controlling how much a particular field contributes to the overall _score. This can be done at index time or at query time.

Boosting at the request level

Boosting during a query is the easiest way to customize your search for specific needs. For example, if a user is searching for a product, you can make the product name have more weight than its description.

Example of a request with boosting for a field title:

GET /products/_search
{
  "query": {
    "multi_match": {
      "query": "laptop",
      "fields": ["title^3", "description^1"]
    }
  }
}

In this request the field title has a boost of 3, and the field description — 1 (default value). This means that documents in which the term “laptop” appears in titlewill rank higher than those in which the term only appears in description.

Working with multipoles

Elasticsearch allows you to search across multiple fields at once. Boosting can be configured for each of these fields separately.

Example with multiple fields:

GET /articles/_search
{
  "query": {
    "multi_match": {
      "query": "Elasticsearch",
      "fields": ["title^4", "summary^2", "content^1"]
    }
  }
}

Here the term is searched in three fields at once: title, summary And content. But the title has the greatest weight, the short description has medium weight, and the main content has the least weight. This is a classic scheme for blogs and news sites, where titles and short descriptions are often more important for search than the text of the article itself.

How multi_match works:

  1. “best_fields” – by default, if the term is found in one field, then only that field affects the _score.

  2. “most_fields” – the more fields the term contains, the higher the _score.

  3. “cross_fields” – the field is treated as a single index.

Index Boosting

Although Elasticsearch recommends using query-time boostingit is also possible to set the weight of fields at the indexing stage. This can be useful if there is a predefined structure where the value of some field should always be a priority.

Example of index boosting (since version 5.0 it is recommended to use query time boosting):

PUT /my-index
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "boost": 2  // увеличиваем значение title на этапе индексации
      },
      "content": {
        "type": "text"
      }
    }
  }
}

This method is less flexible because it requires reindexing the data when priorities change.

Setting up via Function Score

For those who want to be at the bottom of the iceberg, there is Function Score Query. This is a tool that allows you to not only boost fields, but also apply more complex metrics – for example, you can use field values ​​to dynamically change _score.

Example of Function Score for boosting product popularity:

GET /products/_search
{
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "laptop",
          "fields": ["title^2", "description^1"]
        }
      },
      "field_value_factor": {
        "field": "popularity",
        "factor": 1.5,
        "modifier": "log1p"
      }
    }
  }
}

In this request we take into account not only title And descriptionbut also popularity product, which is boosted on a logarithmic scale.

Don't overdo it with boosting. Increasing the field weight to absurd values ​​(eg title^1000) can result in search results that are too narrow and inadequate. Find a balance between relevance and breadth of coverage.


Conclusion

To achieve the best results, it is important not only to set up your algorithms correctly, but also to test them on real data, ensuring that the changes actually improve the search engine. Approach Elasticsearch setup wisely, test at every step, and your search will be not only fast, but also as accurate as possible.

You can learn more about application infrastructure in online courses from practicing industry experts. Details in the catalogue.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *