ElasticSearch — Search for a sequence in text

Hi! This is Arkady from T-Bank, we are still doing TQM, and in this article I will show how we solved the problem of searching for sequences in the text of communications. This works both on simple chains of phrases in order, and on complex cases – with the time of the phrase, the channel “client – operator”. We are still working with ElasticSearch, leaving the possibility to “wind up” such things as RAG, LLM and other fashionable technologies on the text search.

Some limitations for today's task:

  • Non-linear increase in query complexity with increasing number of phrases. Therefore, our limit is 4.

  • We chose a timing step of 5 seconds. After each phrase, we put a timestamp or several timestamps if the phrase took more than 5 seconds. If you make the step too small, it will allow you to search more accurately, but will litter our field with timestamps. It seems that this is the moment when it is better to agree on the requirements in advance.

And now to the most interesting part. Welcome under the cut!

Finding a solution

In the last article we created an indexlearned to search by it, caught and fixed several problems. This time the managers brought a more difficult task. A typical search scenario looks like this: we have dialogues where the operator says “Hello”, the client replies “Hello”, “Hello” or any other greeting. Find me all the texts where the operator forgot to introduce himself. We need to find the operator's name, company, department and other similar information. We are not talking about a simple search for the phrase “hello”, in this case we are looking for several phrases at the beginning of the dialogue, said in a certain sequence. A phrase from the client may be inserted between them, they themselves may be broken into several lines, but we must find this sequence or say that in this call the operator forgot to fully introduce himself.

The task is decomposed into a query of the following type: Phrase 1 + Channel + Time, no more than → Phrase 2 + Channel + Time, no more than → ! Phrase 3 (the operator introduces himself).

There are several options for solving the problem.

The straightforward solution is to write a script. The command will look like this:

{

	"script: "(doc['phrases'][0] == 'message1' && doc['phrases'][1] == 'message2') || ... "

}

or

{

	"script: "for (item in doc['phrases']) { if (item == 'message1') { ... } }"

}

The script can go through all possible options.

Advantages of the solution:

  • no restrictions, you can write anything you want into the script;

  • no index rework required;

  • easy to implement.

Disadvantages of the solution:

  • Slow, because the query is executed sequentially for each document, has a nested loop, quadratic complexity. With our servers, this means that if there are more than 100,000 documents, the solution will not work.

  • The script will not work search by word forms. This can be solved by storing a field with initial forms of words and converting the query to this form, but this kills almost all the advantages

A less straightforward solution is to search through a continuous array of text using spans. Span allows you to build queries at a lower level, fully controlling the number, sequence and other parameters of fragment inclusion.

We use intervals and span_containing. The intervals query will help return documents taking into account the order of the matching search subqueries. In this case, the query array looks something like this:

"all_of" : { // ищем все совпадения ( any_of )

	"ordered" : true, // значит, что порядок нам важен

	“intervals”: [ // список интервалов

	{ 

		“match” : {}

	}

	….

	]

}

Let's try to transform our document or communication into the required form. Initially, the document looks like this:

{

              "message_source_type" : 1,

              "message" : "Здравствуйте"

            },

            {

              "message_source_type" : 2,

              "message" : "Здравствуйте"

            },

            .....

            {

              "message_source_type" : 1,

              "message" : "До свидания"

            },

We store the entire document as marked text, tagging the text with a channel, such as client or operator, and any other tags such as “negative phrase”, smiley, and so on. It turns out something like:

"sequential_data" : "

_s _1s _dd64f641bf052479288baecd291ec329c _d066d2f79cc114eb9b0f954221d18c558 _db132abccc3774e169c5aad6de4c372d2 _d20df59fe639547e5af5e339286a5dc73 алло _5 _1e

"

This solution is more complex, but works on a large amount of data. There are also disadvantages: there is no direct ability to search with permutations (intervals does not understand slop). Another problem is that for intervals there is an option to set max_gaps, which works A little differently. And it is very difficult to explain to the customer why in one case we find a phrase, but in another – not. This problem arises very rarely, so so far there have been no questions.

Since we can have a single data stream that can take up 20 TB in total, the ability to quickly work on large amounts of data is our main advantage.

First, let's create a new field where we will store solid markup:

 "sequential_data": {

    "type": "text",

    "fields": {

      "exact": {

        "type": "text",

        

      }

    },

    "analyzer": //тут наши кастомные аналайзеры, работу с которыми описали в другой статье

  }

Now let's come up with a markup, make tags depending on the channels:

_s _e — start/end of the document;

_1s _1e — channel number one, for example, the client channel;

_2s 2e — channel number two, for example the operator channel;

t5 5, 10, 15 and until the end of the conversation – time marks, we write them in the index.

The result is a field that stores text of the following type:

"_source" : {

.....

"sequential_data" : "

_s //старт документа

_1s // старт фразы канала 1 (клиент)

алло 

_1e // конец фразы канала 1

_2s здравствуйте аркадий аркадьевич _2e 

_1s алло вы куда звоните там девушка _1e 

_2s меня зовут достоевский федор михайлович отдел премий Т-банка вам 
знаком сидоров михаил михайлович? _2s 

_1s знаком _1e 

_2s спасибо что уделили время всего доброго до свидания _2e 

_e

"

....

}

The hardest part is behind us, now we can start the search itself.

Implementation of search

The simplest query in our task will look like this:

{

  "must": [

    {

      "intervals": {

        "sequential_data": {

          "all_of": {

            "intervals": [

              {

                "any_of": {

                  "intervals": [

                    {

                      "match": {

                        "max_gaps": 2,

                        "query": "меня зовут" — фраза два (порядок обратный)

                      }

                    }

                  ],

                  "filter": {

                    "contained_by": {

                      "match": {

                        "ordered": true,

                        "query": "_1s _1e" — фраза обернута в теги начала и окончания для канала 1

                      }

                    }

                  }

                }

              }

            ],

            "filter": {

              "after": {

                "all_of": {

                  "intervals": [

                    {

                      "any_of": {

                        "intervals": [

                          {

                            "match": {

                              "max_gaps": 2,

                              "query": "здравствуйте" — первая фраза, которую мы хотим найти

                            }

                          }

                        ],

                        "filter": {

                          "contained_by": { 

                            "match": {

                              "ordered": true,

                              "query": "_1s _1e" — фраза обернута в теги начала и окончания для канала 1

                            }

                          }

                        }

                      }

                    }

                  ]

                }

              }

            }

          }

        }

      }

    }

  ]

}

A more complex case, when we are looking for an operator who has not introduced himself:

  "must": [

    {

      "bool": {

        "must": [

          {

            "intervals": {

              "sequential_data": {

                "all_of": {

                  "intervals": [

                    {

                      "any_of": {

                        "intervals": [

                          {

                            "match": {

                              "max_gaps": 2,

                              "query": "здравствуйте"

                            }

                          }

                        ],

                        "filter": {

                          "contained_by": {

                            "match": {

                              "ordered": true,

                              "query": "_2s _2e"

                            }

                          }

                        }

                      }

                    }

                  ]

                }

              }

            }

          }

        ],

        "must_not": [

          {

            "intervals": {

              "sequential_data": {

                "all_of": {

                  "intervals": [

                    {

                      "any_of": {

                        "intervals": [

                          {

                            "match": {

                              "max_gaps": 2,

                              "query": "меня зовут"

                            }

                          }

                        ],

                        "filter": {

                          "contained_by": {

                            "match": {

                              "ordered": true,

                              "query": "_2s _2e"

                            }

                          }

                        }

                      }

                    }

                  ],

                  "filter": {

                    "after": {

                      "all_of": {

                        "intervals": [

                          {

                            "any_of": {

                              "intervals": [

                                {

                                  "match": {

                                    "max_gaps": 2,

                                    "query": "здравствуйте"

                                  }

                                }

                              ],

                              "filter": {

                                "contained_by": {

                                  "match": {

                                    "ordered": true,

                                    "query": "_2s _2e"

                                  }

                                }

                              }

                            }

                          }

                        ]

                      }

                    }

                  }

                }

              }

            }

          }

        ]

      }

    }

In a more complex case, you need to connect two conditions:

  • We are looking for all calls in which the operator was involved: hello.

  • We are looking for all calls where there was no chain “operator: hello” → “operator: my name is”. This is actually a mind-blowing idea that we should create a condition by which we search for a sequence, and then wrap this condition in a must_not operator. It takes some getting used to.

Let's add some coolness to our search by searching for a sequence over a time interval. In this case, we use the timestamps that we added to the text earlier. For us, 5 seconds accuracy is enough, but you can make them arbitrary.

For example: find me all the texts where the operator forgot to introduce himself (operator name, company, department, etc.) within 10 seconds.

If in words, we are looking for Phrase (the operator introduces himself) → timestamp. In the query, we want to find “my name is” before _5 _5 timestamps:

{

  "query": {

    "bool": {

      "must": [

        {

          "bool": {

            "must_not": [

              {

                "intervals": {

                  "sequential_data": {

                    "all_of": {

                      "intervals": [

                        {

                          "any_of": {

                            "intervals": [

                              {

                                "match": {

                                  "max_gaps": 2,

                                  "query": "меня зовут"

                                }

                              }

                            ],

                            "filter": {

                              "contained_by": {

                                "match": {

                                  "ordered": true,

                                  "query": "_2s _2e"

                                }

                              }

                            }

                          }

                        }

                      ],

                      "filter": {

                        "contained_by": {

                          "any_of": { // ищем любое совпадение, или метку времени (10 секунд), или завершение диалога без меток времени перед ним.

                            "intervals": [

                              {

                                "match": {

                                  "ordered": true,

                                  "query": "_s _5 _5"

                                }

                              },

                              { //это условие на случай, если разговор слишком быстро закончится

                                "match": {

                                  "ordered": true,

                                  "query": "_s _e",

                                  "filter": {

                                    "not_containing": {

                                      "match": {

                                        "ordered": true,

                                        "query": "_5 _5"

                                      }

                                    }

                                  }

                                }

                              }

                            ]

                          }

                        }

                      }

                    }

                  }

                }

              }

            ]

          }

        

      ]

    }

  }

} 

With the help of such constructions, you can search for missing phrases, correct or incorrect farewells, and much more. We actively use markup by models and can use them in sequential search. We made another markup format for them and mark up the body of communication with their help.

For dessert, let's look at how to solve a case with a repeated phrase. For example, a person writes “credit” three or more times. How can I find all the chats with this desperate call? Judging by stackoverflow, the problem is relevant.

What's new in this case is we use after — indicating the order of requests. The request will look like this:

{

  "must": [

    {

      "bool": {

        "must": [

          {

            "intervals": {

              "sequential_data": {

                "all_of": {

                  "intervals": [

                    {

                      "any_of": {

                        "intervals": [

                          {

                            "match": {

                              "max_gaps": 2,

                              "query": "кредит"

                            }

                          }

                        ],

                        "filter": {

                          "contained_by": {

                            "match": {

                              "ordered": true,

                              "query": "_2s _2e"

                            }

                          }

                        }

                      }

                    }

                  ],

                  "filter": {

                    "after": {

                      "all_of": {

                        "intervals": [

                          {

                            "any_of": {

                              "intervals": [

                                {

                                  "match": {

                                    "max_gaps": 2,

                                    "query": "кредит"

                                  }

                                }

                              ],

                              "filter": {

                                "contained_by": {

                                  "match": {

                                    "ordered": true,

                                    "query": "_2s _2e"

                                  }

                                }

                              }

                            }

                          }

                        ],

                        "filter": {

                          "after": {

                            "all_of": {

                              "intervals": [

                                {

                                  "any_of": {

                                    "intervals": [

                                      {

                                        "match": {

                                          "max_gaps": 2,

                                          "query": "кредит"

                                        }

                                      }

                                    ],

                                    "filter": {

                                      "contained_by": {

                                        "match": {

                                          "ordered": true,

                                          "query": "_2s _2e"

                                        }

                                      }

                                    }

                                  }

                                }

                              ],

                              "filter": {

                                "after": {

                                  "all_of": {

                                    "intervals": [

                                      {

                                        "any_of": {

                                          "intervals": [

                                            {

                                              "match": {

                                                "max_gaps": 2,

                                                "query": "кредит"

                                              }

                                            }

                                          ],

                                          "filter": {

                                            "contained_by": {

                                              "match": {

                                                "ordered": true,

                                                "query": "_2s _2e"

                                              }

                                            }

                                          }

                                        }

                                      }

                                    ]

                                  }

                                }

                              }

                            }

                          }

                        }

                      }

                    }

                  }

                }

              }

            }

          }

        ]

      }

    }

  ]

}

Conclusion

Elasticsearch is quite suitable for implementing sequential search tasks. Using markup, you can search for cases in dialogues, sequence in a contract, the presence or absence of clauses and other documents where the structure is important.

Everything described works on Opensearch, which is relevant due to license changes. And if you have questions or want to share your experience, I'm waiting in the comments!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *