index architecture, search execution, and data replication

This is a translation of my article in my blog about Apache Lucene architectureabout one of the most famous search index implementation libraries. Elasticsearch and Solr, both well-known implementations of scalable search solutions, use this library under the hood. I work on creating search solutions for the e-commerce industry, and I constantly come across this library in my daily work. Apache Lucene implements most of the necessary functionality for building a search engine. Starting with a tokenization process that extracts canonical forms of words as tokens, continuing with a full implementation of an inverted index, and ending with near real-time segment replication. The number of practically useful features implemented over the two decades of the library’s existence is colossal. This library integrates knowledge from linguistics, mathematics and computer science.

Inverted index

Apache Lucene implements an inverted index architecture. At the implementation level, a logical index contains a collection of immutable segments stored as files in the file system. Each segment is itself an inverted index. Such an index is a dictionary data structure with terms as keys and postings as values. A posting is a list of document identifiers and the number of occurrences of a term in a given document. This dictionary uses Finite State Transducers, FST [1] to search for terms, which can be thought of as something similar to sorted lists with gaps [2]. This sorted navigation map is the cornerstone for efficiently searching through huge volumes of documents. Lucene is also very memory efficient. Among other algorithms, it uses difference coding algorithms to compress document identifiers in postings [3]. To put it simply, the idea behind this compression is to sort a list of integers and store the deltas between them. This also improves disk I/O performance.

In addition to the inverted index, Lucene can use a column-oriented store known as docValues ​​for specific fields within a document. It is an important data structure in the query pipeline for grouping and sorting operations. It stores all the values ​​of a particular field, sorted by document IDs [4]. This turns the operation of retrieving field values ​​for a given set of documents into one sequential read operation from disk. Previously, multiple repeating operations of disk reading of document fields and their caching using FieldCache were actively used.

In terms of Java classes, Lucene contains two main ones for accessing index files: IndexWriter And IndexReader. The JavaDoc Lucene documentation is a really good source of information. The IndexWriter class is intended to be used in a single instance of an application that accumulates updates in memory and, under certain conditions, updates, commits, or saves them to the file system. In addition, Writer applies a segment merging rule to control the number of segments. Each commit creates new immutable segment files, marks old ones inactive and deletes documents marked for deletion. Currently the default implementation is the class TieredMergePolicy.

Search

At first glance, implementing a search may seem like a simple task, but with practical implementation, many nuances arise. The request text is converted into a list of tokens using various analyzers. For each token, an index search is performed. Intermediate results are stored in a bitset data structure. Interface Implementations Bits Lucene makes it easier to perform union/intersection operations on search results for multiple tokens [5]. The bitset simplifies caching during the filter application phase. When filtering or sorting is required, this is done by intersecting the bitset with docValues ​​for a given set of document IDs.

Replication

There are two types of replication: document replication and segment replication. For a long time, Elasticsearch and OpenSearch (a fork of Amazon's Elasticsearch) only supported replication of individual documents. Apache Lucene has added support for near real-time segment replication for some time now. [6]. Implementing document replication in Elasticsearch requires the coordinating node to collect responses from all replicas to a write operation, which creates additional CPU load. Yelp, which created the opensource project Nrtsearch, shared an interesting experience in implementing segment replication. [7]. Amazon followed this approach [8] with segment replication in OpenSearch. They reduce the load on the CPU by increasing the load on the network and increasing the load on the main node [9]. Now OpenSearch is going further: to reduce network bandwidth, they are implementing support for network storage. They are also experimenting with peer-to-peer segment replication. [10].

Links

1. Using Finite State Transducers in Lucene – https://dzone.com/articles/using-finite-state-transducers

2. Dissecting Lucene – The Index formathttps://kandepet.com/dissecting-lucene-the-index-format/

3. Lucene Inside Out – Dealing With Integer Encoding and Compression – https://towardsdatascience.com/lucene-inside-out-dealing-with-integer-encoding-and-compression-fe28f9dd265d

4. What is an Apache Lucene Codec? — https://www.elastic.co/blog/what-is-an-apache-lucene-codec

5. Exploring Apache Lucene – Part 2: Search and Ranking – https://j.blaszyk.me/tech-blog/exploring-apache-lucene-search-and-ranking/

6. Lucene's near-real-time segment index replication – https://www.javacodegeeks.com/2017/09/lucenes-near-real-time-segment-index-replication.html

7. Nrtsearch: Yelp's Fast, Scalable and Cost Effective Search Engine – https://engineeringblog.yelp.com/2021/09/nrtsearch-yelps-fast-scalable-and-cost-effective-search-engine.html

8. OpenSearch Segment Replication [RFC] — https://github.com/opensearch-project/OpenSearch/issues/1694

9. First Bite of OpenSearch Segment Replication – https://medium.com/@rockybean/first-bite-of-opensearch-segment-replication-b22d0808e4c9

10. Reduce compute costs and increase throughput with segment replication, generally available in OpenSearch 2.7 — https://opensearch.org/blog/segment-replication/

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *