How KMP and Boyer-Moore Algorithms Improve Search Engines

Knuth-Morris-Pratt (KMP) algorithm

Knuth-Morris-Pratt algorithm was developed by Donald Knuth, James Morris, and Vaughan Pratt in 1977. The algorithm was revolutionary due to its ability to perform search in linear time in the worst case.

The algorithm is designed to find a substring in a string. It solves the problem of naive searchwhich allows multiple comparisons of the same characters by using a pre-computed prefix function. Prefix function for a string P lengths m defined as an array \piWhere \pi[i] specifies the length of the largest proper prefix of a string P[0..i]which is also a suffix of this string.

The construction of the prefix function is performed as follows:

  1. Initialization: \pi[0] = 0.

  2. For each symbol P[i] (beginning with i = 1):

Thus, The prefix function allows you to understand how many characters from the beginning of a string match its end.

An example of constructing a prefix function for the string “ababaca“:

  • P[0] = a : \pi[0] = 0

  • P[1] = b : \pi[1] = 0

  • P[2] = a : \pi[2] = 1(prefix “a” is the same as suffix “a”)

  • P[3] = b : \pi[3] = 2 (prefix “ab” is the same as suffix “ab”)

  • P[4] = a : \pi[4] = 3 (the prefix “aba” is the same as the suffix “aba”)

  • P[5] = c : \pi[5] = 0 (no matches)

  • P[6] = a : \pi[6] = 1 (prefix “a” is the same as suffix “a”)

The search process

The process of searching for a string P in line T is as follows:

  1. Initialization: i = 0index for T,  j = 0 index for P.

  2. Let's repeat until i < n length T:

Thus, the KMP algorithm uses a prefix function to avoid repeated comparisons.

Example

Let's look at an example where we are looking for the substring “ababaca” in line “bacbabababacaca“:

  1. Find the prefix function for “ababaca“: 0, 0, 1, 2, 3, 0, 1.

  2. We apply the search algorithm:

Example in code

Let's implement the algorithm in Python:

def kmp_search(pattern, text):
    def compute_prefix_function(pattern):
        m = len(pattern)
        pi = [0] * m
        k = 0

        for q in range(1, m):
            while k > 0 and pattern[k] != pattern[q]:
                k = pi[k - 1]
            if pattern[k] == pattern[q]:
                k += 1
            pi[q] = k

        return pi

    n = len(text)
    m = len(pattern)
    pi = compute_prefix_function(pattern)
    q = 0

    for i in range(n):
        while q > 0 and pattern[q] != text[i]:
            q = pi[q - 1]
        if pattern[q] == text[i]:
            q += 1
        if q == m:
            print(f"Pattern occurs at index {i - m + 1}")
            q = pi[q - 1]

# пример использования
text = "ABC ABCDAB ABCDABCDABDE"
pattern = "ABCDABD"
kmp_search(pattern, text)

Function compute_prefix_function(pattern) function computes the prefix function for a pattern. The prefix function determines, for each index of the pattern, the length of the largest proper prefix that is also a suffix.

pi – an array containing the values ​​of the prefix function.

k – the length of the current largest prefix that matches the suffix.

Main function kmp_search(pattern, text):

n – text length, m – template length.

Calculate the prefix function for a pattern using compute_prefix_function.

q – the number of characters that match the beginning of the pattern.

For each character in the text, we check whether it matches the current character in the template. If so, we increment q. If q equals the length of the template, then a match has been found, and we print the index of the beginning of the match in the text. Then we reset q to the value from the prefix function to continue the search.

Let's move on to the next algorithm.

Boyer-Moore algorithm

Boyer-Moore algorithm was developed by Robert Boyer and Jay Moore also in 1977.

Algorithm uses two key heuristics To speed up the search: the bad character rule and the good suffix rule. These heuristics allow the algorithm to skip large parts of the string when comparing.

Bad Symbol Rule

The bad character rule is based on the fact that if a mismatch occurs during the comparison of a substring with a part of a string, the algorithm uses the information about the mismatched character to skip several characters of the string rather than checking them again.

Principle of operation:

  1. A table of bad characters is created for the template P. This table contains the positions of each character in the pattern, taking into account its last appearance.

  2. During search, if the character in the string T does not match a character in the pattern P, the pattern is shifted so that the mismatched character in the string matches the last occurrence of that character in the pattern.

Example:

Let P = "EXAMPLE" And T = "HERE IS A SIMPLE EXAMPLE".

At the first mismatch (for example, on the symbol I in line and Ein the template, the algorithm looks at the table of bad symbols and sees that I missing from the template. The pattern is then shifted so that the next character in the line matches the last character in the pattern.

Rule of a good suffix

The good suffix rule works based on matches at the end of a pattern, called suffixes. When a mismatch occurs, the algorithm uses information about the matching suffixes to skip characters.

Principle of operation:

  1. A table of suffixes for the template is created P. This table contains information about the positions of suffixes in the pattern that can be used for offsetting.

  2. If part of the pattern matches part of the string, but then a mismatch occurs, the algorithm shifts the pattern so that the next matching suffix aligns with the corresponding part of the string.

Example:

Let P = "ABCDABD" And T = "ABC ABCDAB ABCDABCDABDE".

When a mismatch occurs on a symbol E in line and D in the pattern, the algorithm looks at the suffix table and determines that the next possible suffix matches the position in the string, which allows the pattern to be shifted several positions to the right.

Example of use

Consider the string T and the pattern P:

  • Step 1: Align the template with the beginning of the line.

  • Step 2: Compare characters from the end of the template.

  • Step 3: If there is a mismatch, we apply the bad symbol rule or the good suffix rule.

  • Step 4: We shift the template and repeat the process.

Example:

Let P = "ABCDABD" And T = "ABC ABCDAB ABCDABCDABDE".

  1. Initialization:

  2. Search:

    • Compare the string and the template.

    • If there is a mismatch, we find the appropriate offset using the tables.

    • We shift the template and continue the comparison.

The Boyer-Moore algorithm has medium complexity. O(n/m) for practical cases where n — the length of the line, and m — template length. This algorithm is very good when working with large strings and complex patterns.

Code example:

We implement it similarly to the previous algorithm in Python:

def boyer_moore_search(pattern, text):
    def bad_character_table(pattern):
        table = {}
        m = len(pattern)
        for i in range(m - 1):
            table[pattern[i]] = i
        return table

    def good_suffix_table(pattern):
        m = len(pattern)
        table = [0] * m
        last_prefix_position = m

        for i in range(m - 1, -1, -1):
            if is_prefix(pattern, i + 1):
                last_prefix_position = i + 1
            table[m - 1 - i] = last_prefix_position - i + m - 1

        for i in range(m - 1):
            slen = suffix_length(pattern, i)
            table[slen] = m - 1 - i + slen

        return table

    def is_prefix(pattern, p):
        m = len(pattern)
        for i in range(p, m):
            if pattern[i] != pattern[i - p]:
                return False
        return True

    def suffix_length(pattern, p):
        m = len(pattern)
        length = 0
        i = p
        j = m - 1
        while i >= 0 and pattern[i] == pattern[j]:
            length += 1
            i -= 1
            j -= 1
        return length

    n = len(text)
    m = len(pattern)
    if m == 0:
        return []

    bad_char_table = bad_character_table(pattern)
    good_suffix_table = good_suffix_table(pattern)

    s = 0
    while s <= n - m:
        j = m - 1
        while j >= 0 and pattern[j] == text[s + j]:
            j -= 1
        if j < 0:
            print(f"Pattern occurs at index {s}")
            s += good_suffix_table[0]
        else:
            bad_char_shift = j - bad_char_table.get(text[s + j], -1)
            good_suffix_shift = good_suffix_table[j]
            s += max(bad_char_shift, good_suffix_shift)

# пример
text = "HERE IS A SIMPLE EXAMPLE"
pattern = "EXAMPLE"
boyer_moore_search(pattern, text)

Function bad_character_table(pattern) creates a table of bad characters for a pattern. The table contains the index of the last occurrence of each character in the pattern, except for the last character.

Function good_suffix_table(pattern) creates a table of good suffixes for the pattern. The table contains offsets that are defined for each pattern suffix.

Secondary functions is_prefix(pattern, p) And suffix_length(pattern, p):

is_prefix checks if a substring is from position p to the end of the template by prefixing the entire template.

suffix_length determines the length of the largest pattern suffix that ends at position p.

Main function boyer_moore_search(pattern, text)calculates tables of bad characters and good suffixes for a pattern. Then it searches for the pattern in a string, using both tables to efficiently skip characters when there are no matches. And finally it outputs the indices of the occurrences of the pattern in the string.


As you can see, these algorithms are very powerful and each offers unique approaches and advantages.

KMP uses a prefix function to control the bias on mismatches, achieving linear worst-case complexity.

The Boyer-Moore algorithm, however, uses two key heuristics: the bad character rule and the good suffix rule, which significantly speed up the search process by skipping many characters.

The choice between KMP and Boyer-Moore depends on the specific use case and performance requirements.

You can develop algorithmic thinking and learn to increase the productivity of programs on the online course “Algorithms and Data Structures” in Otus.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *