How KMP and Boyer-Moore Algorithms Improve Search Engines
Knuth-Morris-Pratt (KMP) algorithm
Knuth-Morris-Pratt algorithm was developed by Donald Knuth, James Morris, and Vaughan Pratt in 1977. The algorithm was revolutionary due to its ability to perform search in linear time in the worst case.
The algorithm is designed to find a substring in a string. It solves the problem of naive searchwhich allows multiple comparisons of the same characters by using a pre-computed prefix function. Prefix function for a string lengths defined as an array Where specifies the length of the largest proper prefix of a string which is also a suffix of this string.
The construction of the prefix function is performed as follows:
Initialization: .
For each symbol (beginning with ):
Thus, The prefix function allows you to understand how many characters from the beginning of a string match its end.
An example of constructing a prefix function for the string “ababaca“:
(prefix “a” is the same as suffix “a”)
(prefix “ab” is the same as suffix “ab”)
(the prefix “aba” is the same as the suffix “aba”)
(no matches)
(prefix “a” is the same as suffix “a”)
The search process
The process of searching for a string in line is as follows:
Initialization: index for , index for .
Let's repeat until length :
Thus, the KMP algorithm uses a prefix function to avoid repeated comparisons.
Example
Let's look at an example where we are looking for the substring “ababaca” in line “bacbabababacaca“:
Find the prefix function for “ababaca“:
We apply the search algorithm:
Example in code
Let's implement the algorithm in Python:
def kmp_search(pattern, text):
def compute_prefix_function(pattern):
m = len(pattern)
pi = [0] * m
k = 0
for q in range(1, m):
while k > 0 and pattern[k] != pattern[q]:
k = pi[k - 1]
if pattern[k] == pattern[q]:
k += 1
pi[q] = k
return pi
n = len(text)
m = len(pattern)
pi = compute_prefix_function(pattern)
q = 0
for i in range(n):
while q > 0 and pattern[q] != text[i]:
q = pi[q - 1]
if pattern[q] == text[i]:
q += 1
if q == m:
print(f"Pattern occurs at index {i - m + 1}")
q = pi[q - 1]
# пример использования
text = "ABC ABCDAB ABCDABCDABDE"
pattern = "ABCDABD"
kmp_search(pattern, text)
Function compute_prefix_function(pattern)
function computes the prefix function for a pattern. The prefix function determines, for each index of the pattern, the length of the largest proper prefix that is also a suffix.
pi
– an array containing the values of the prefix function.
k
– the length of the current largest prefix that matches the suffix.
Main function kmp_search(pattern, text)
:
n
– text length, m
– template length.
Calculate the prefix function for a pattern using compute_prefix_function
.
q
– the number of characters that match the beginning of the pattern.
For each character in the text, we check whether it matches the current character in the template. If so, we increment q
. If q
equals the length of the template, then a match has been found, and we print the index of the beginning of the match in the text. Then we reset q
to the value from the prefix function to continue the search.
Let's move on to the next algorithm.
Boyer-Moore algorithm
Boyer-Moore algorithm was developed by Robert Boyer and Jay Moore also in 1977.
Algorithm uses two key heuristics To speed up the search: the bad character rule and the good suffix rule. These heuristics allow the algorithm to skip large parts of the string when comparing.
Bad Symbol Rule
The bad character rule is based on the fact that if a mismatch occurs during the comparison of a substring with a part of a string, the algorithm uses the information about the mismatched character to skip several characters of the string rather than checking them again.
Principle of operation:
A table of bad characters is created for the template . This table contains the positions of each character in the pattern, taking into account its last appearance.
During search, if the character in the string does not match a character in the pattern P, the pattern is shifted so that the mismatched character in the string matches the last occurrence of that character in the pattern.
Example:
Let And .
At the first mismatch (for example, on the symbol in line and in the template, the algorithm looks at the table of bad symbols and sees that missing from the template. The pattern is then shifted so that the next character in the line matches the last character in the pattern.
Rule of a good suffix
The good suffix rule works based on matches at the end of a pattern, called suffixes. When a mismatch occurs, the algorithm uses information about the matching suffixes to skip characters.
Principle of operation:
A table of suffixes for the template is created . This table contains information about the positions of suffixes in the pattern that can be used for offsetting.
If part of the pattern matches part of the string, but then a mismatch occurs, the algorithm shifts the pattern so that the next matching suffix aligns with the corresponding part of the string.
Example:
Let And .
When a mismatch occurs on a symbol in line and in the pattern, the algorithm looks at the suffix table and determines that the next possible suffix matches the position in the string, which allows the pattern to be shifted several positions to the right.
Example of use
Consider the string T and the pattern P:
Step 1: Align the template with the beginning of the line.
Step 2: Compare characters from the end of the template.
Step 3: If there is a mismatch, we apply the bad symbol rule or the good suffix rule.
Step 4: We shift the template and repeat the process.
Example:
Let And .
Initialization:
Search:
Compare the string and the template.
If there is a mismatch, we find the appropriate offset using the tables.
We shift the template and continue the comparison.
The Boyer-Moore algorithm has medium complexity. for practical cases where — the length of the line, and — template length. This algorithm is very good when working with large strings and complex patterns.
Code example:
We implement it similarly to the previous algorithm in Python:
def boyer_moore_search(pattern, text):
def bad_character_table(pattern):
table = {}
m = len(pattern)
for i in range(m - 1):
table[pattern[i]] = i
return table
def good_suffix_table(pattern):
m = len(pattern)
table = [0] * m
last_prefix_position = m
for i in range(m - 1, -1, -1):
if is_prefix(pattern, i + 1):
last_prefix_position = i + 1
table[m - 1 - i] = last_prefix_position - i + m - 1
for i in range(m - 1):
slen = suffix_length(pattern, i)
table[slen] = m - 1 - i + slen
return table
def is_prefix(pattern, p):
m = len(pattern)
for i in range(p, m):
if pattern[i] != pattern[i - p]:
return False
return True
def suffix_length(pattern, p):
m = len(pattern)
length = 0
i = p
j = m - 1
while i >= 0 and pattern[i] == pattern[j]:
length += 1
i -= 1
j -= 1
return length
n = len(text)
m = len(pattern)
if m == 0:
return []
bad_char_table = bad_character_table(pattern)
good_suffix_table = good_suffix_table(pattern)
s = 0
while s <= n - m:
j = m - 1
while j >= 0 and pattern[j] == text[s + j]:
j -= 1
if j < 0:
print(f"Pattern occurs at index {s}")
s += good_suffix_table[0]
else:
bad_char_shift = j - bad_char_table.get(text[s + j], -1)
good_suffix_shift = good_suffix_table[j]
s += max(bad_char_shift, good_suffix_shift)
# пример
text = "HERE IS A SIMPLE EXAMPLE"
pattern = "EXAMPLE"
boyer_moore_search(pattern, text)
Function bad_character_table(pattern)
creates a table of bad characters for a pattern. The table contains the index of the last occurrence of each character in the pattern, except for the last character.
Function good_suffix_table(pattern)
creates a table of good suffixes for the pattern. The table contains offsets that are defined for each pattern suffix.
Secondary functions is_prefix(pattern, p)
And suffix_length(pattern, p)
:
is_prefix
checks if a substring is from position p
to the end of the template by prefixing the entire template.
suffix_length
determines the length of the largest pattern suffix that ends at position p
.
Main function boyer_moore_search(pattern, text)
calculates tables of bad characters and good suffixes for a pattern. Then it searches for the pattern in a string, using both tables to efficiently skip characters when there are no matches. And finally it outputs the indices of the occurrences of the pattern in the string.
As you can see, these algorithms are very powerful and each offers unique approaches and advantages.
KMP uses a prefix function to control the bias on mismatches, achieving linear worst-case complexity.
The Boyer-Moore algorithm, however, uses two key heuristics: the bad character rule and the good suffix rule, which significantly speed up the search process by skipping many characters.
The choice between KMP and Boyer-Moore depends on the specific use case and performance requirements.
You can develop algorithmic thinking and learn to increase the productivity of programs on the online course “Algorithms and Data Structures” in Otus.