How we built the index in Elasticsearch


What was the task and what did we want to achieve

Hello! My name is Daniil, and we at Just AI are developing a platform for creating various chatbots. And in order to simplify this process as much as possible, namely the process of writing a bot script, we have our own DSL.

With it, you can describe the behavior of your bot, and with the help of javascript, fill the bot with various custom logic. Bot developers on the platform use our web IDE for this, which supports this DSL.

A script for a bot can consist of a large number of files in which you want to navigate and search for information of interest.

Let’s say a few words about what kind of search we wanted to get as a result when we did it? Simply put, the same as in any IDE to which we are accustomed. So that you can search not only by partial match, but also by regex, and by full match of the word, as well as both case-sensitive and without.

In fact, exactly what is shown in the image below:

What will be in the article and what will not be

In this article, we will explore our path in building an index in Elasticsearch. It will be much easier for you to read if you already have an understanding of what kind of animal this is.

There will be a description of several concepts of this search engine, but this article definitely cannot be called a tutorial on working with Elasticsearch. This is our experience of creating an index structure for searching in code files.

Why Elasticsearch?

The first question to be answered is “why Elasticsearch?”. And the answer is very prosaic. Our team at that time had no experience with search engines, so it was reasonable and obvious to choose the most popular of all. Plus, we also had experience with the operations team working with Elasticsearch in Graylog. Almost everyone has heard about this search engine, it is used for log search systems, there are a huge number of articles and various examples about it, and quite good documentation. That’s how the choice fell on him.

How do we store our files

As already described above, what we want to fasten our search on are files with code. So let’s first take a closer look at our source files. We store the data in MongoDB and each individual file is stored as one individual document in the collection.

Let’s look at a code example to make it more clear:

theme: /

    state: Start
        q!: $regex</start>
        a: Let's start.

    state: Hello
        intent!: /hello
        a: Hello hello

    state: Bye
        intent!: /bye
        a: Bye bye

    state: NoMatch
        event!: noMatch
        a: I do not understand. You said: {{$request.query}}

This file is stored in MongoD in the following structure:

{
	"fileName": String,
	"content": BinData
}

And a concrete example can be represented as:

{
  "fileName": "main.sc",
	"content":"ewogICAgInByb2plY3QiO...0KICAgIF0KfQ=="
}

Now it is clear what we are looking for and among what we are looking for.
But now the main question of this article arises… Do we need to somehow transform the current file structure for the search index, and if so, how?

How to translate our data into Elasticsearch

There are various tools for migrating data to Elasticsearch. There is, for example, the well-known Logstash, which allows you to asynchronously migrate your data from various sources to Elasticsearch according to a given configuration. Using this tool, you can set various settings for filtering and transforming data.

Pros of migrating via Logstash:

  • well-known time-tested product in terms of load, stability and latency

  • we only need to write the configuration code, and not write all the business logic code ourselves

  • an external component that can be scaled independently

Cons of migration via Logstash for our case:

  • it’s not obvious how complex the transformation code can be written

  • removal of the transformation logic from the main logic of searching and working with files has a bad effect on the overall picture of how the service works

The advantages of Logstash or any other option like Logstash are significant, but in our particular case they do not exceed the disadvantages that we get even when trying to use such a tool.

In this connection, we decided to perform the transformation in the code of our application and also control the migration from the application code on our own.

The first version of the index

Given our conditions that we need to find the location of the found line, its number in the file, we can build an index in two ways. First: each document is one single line. Those. each document is small and of a known size. The second option: when we have each document, just like in MongoDB, it is a whole file, but the lines are already broken and nested in a list with a line number.

A small digression to the basics, why this is so. Elasticsearch is a search engine based on reverse index. In other words, if the search string is matched with the document, then it returns the entire document and that’s it. In this connection, we need to store additional meta information so that all the necessary data is returned to us along with the document.

Since in MongoDB we store the entire file in one document, Elasticsearch also decided to store all the data in one document as a first step. It was also done with the idea that in this way the entire index would take up less memory than if each document was a separate line.

The Elasticsearch schema in this case can be described as follows:

{
    "files_index": {
        "mappings": {
            "properties": {
                "fileName": {
                    "type": "keyword"
                },
                "lines": {
                    "type": "nested",
                    "properties": {
                        "line": {
                            "type": "text",
                            "analyzer": "ngram_analyzer"
                        },
                        "lineNumber": {
                            "type": "integer"
                        }
                    }
                }
            }
        }
    }
}

Here the field is the most important. lineswhich has type nested. You can find many different articles on the Internet about this data type in Elasticsearch, where most say that “try not to use this data type” or “do not create large nested fields”. Shtosh, we broke both rules…

In a specific data example, the index document looks like this:

{
  "fileName": "main.sc",
	"lines": [
					{"line": "require: slotfilling", "lineNumber": 1},
					...
					{"line": "        a: I do not understand. You said: {{$request.query}}", "lineNumber": 19}
  ]
}

And it worked!

But… there is always a “but”.

Updating documents in the index can be a fairly frequent operation, and since the document in the index turned out to be large, with parallel requests to update in Elasticsearch, a large number of requests simply fell by a timeout of 30s.

It seems that this turned out to be the problem that was warned about in various articles, and it had to be solved somehow.

Making the index smaller

Since the option with a whole file per document did not take off, it means that we need to try the second option, where we have one document – this is one line of the source file.

Now our index structure looks like this:

{
    "files_index": {
        "mappings": {
            "properties": {
                "fileName": {
                    "type": "keyword"
                },
                "line": {
                    "type": "text",
                    "analyzer": "ngram_analyzer"
                },
                "lineNumber": {
                    "type": "integer"
                }
            }
        }
    }
}

And the document now looks like this:

{
  "fileName": "main.sc",
	{
    "line": "require: slotfilling", 
		"lineNumber": 1
  }
}

Thus, the elastic began to update the indexes much faster, was able to parallelize the indexing to the maximum, since now the update of each row went in parallel with a separate request.

And what scared us at the beginning that the index would weigh more did not come true. The size hasn’t changed much.

We have only described the structure of the index, we have understood how to form an index so that it can be used in principle for our needs and updated without errors.

But we have not touched on the question at all, but what will the search query look like in general?

Default search

Let’s look at this example:

You have a document in elasticsearch – “hello world”.

It’s with the default settings, and we want to find it by typing in the search string “hel”. Quite a living example when we are looking for something in our IDE, isn’t it?

So in this case, Elasticsearch will not give you what you are looking for.

It’s all about things like analyzer and tokenizer. With their help, both the search string and the index data are preprocessed. And if they are stored incorrectly, then match will not happen and the document you need will not be issued.

And by default, they break the text by spaces and by special characters, so nothing will be found just by a substring, for example, wor of the word world, but by the substring world of the text “hello world” it will be.

How can we search by part of a word?

This is how we met ngram at Elasticsearch. Namely, here this an article from gitlab gave us confidence that this is exactly what we need.

Ngram is an ngram analyzer in terms of Elasticsearch. It can be specified in the mappings for the field.

Example:

We store the string “hello world” in the index. Let’s say in the ngram analyzer’s settings we have min=3 and max=5.

This means that the text is divided into parts of 3, 4 and 5 characters.

hel, ell, llo, lo, ow, …, rld, …, o wor, worl, world

And if the input string matches one of these substrings, then the “hello world” document will be output.

So we did. You may have noticed above that we are using ngran analyzer for the line field.

"line": {
  "type": "text",
  "analyzer": "ngram_analyzer"
}

But … Here, too, there is a “but”.

This option works well, finds everything you need and finds it quickly. The only problem is that in this way we “inflate” the size of our index very significantly.

We began to look for a solution to this problem and found it.

wild card

Wildcard, as the name implies, is the ability to specify such constructions: hel
In this case, the search is much faster than with regexp, and the required “hello world” document will be returned.

All this works thanks to a special combination of ngram and regexp logic, which allows, while maintaining search speed, to optimize the place occupied by the index.

This analyzer can be specified both for the index field and for preprocessing the search string.

Our final index looks like this:

{
    "files_index": {
        "mappings": {
            "properties": {
                "fileName": {
                    "type": "keyword"
                },
                "line": {
                    "type": "wildcard"
                },
                "lineNumber": {
                    "type": "integer"
                }
            }
        }
    }
}

After switching from ngramm analyzer to wildcard, the size of the index became approximately 4-5 times smaller!

We did not run tests on all of our data in production, but during the research and development of this task. A certain amount of data was generated, on the order of 1 gigabyte, so that changes in the size of the index could be easily understood. First performed on the index POST /<index>/_forcemerge. And then measured the size through GET /_cat/indices/<index>

Of course, here you can highlight your “but”. Since we have fairly small index sizes, this is of the order of 10gb, then everything works very fast for us. Almost all requests do not exceed 0,1c. But it can be assumed that with really large amounts of data hundreds of gigabytes or terabytes, this option can be significantly inferior in speed to the option with ngrams.

Conclusion

  • In this article, you have gone with us from complete ignorance of what Elasticsearch is and how to solve your problems with it, to the moment when the details and nuances are already clear and what you should pay attention to when solving your specific problem.

  • We hope this article was helpful and someone can save a significant amount of time and effort. I also hope that the article will show that Elasticsearch is not a black box that will find what you need by default. But rather, this is a kind of Lego constructor, for which you need to carefully read the instructions.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *