Doctors Knuth, Morris and Pratt, or How I Learned to Stop Worrying and Love the Prefix Function
If you don't know what a string prefix function is, don't know how it's calculated, or most importantly, don't fully understand why the algorithm for calculating the prefix function works in linear time, then this article is for you.
I went through a series of realizations and insights before reaching enlightenment, and now I invite you to walk this path with me.
Step 1. Definition
My first encounter with the prefix function was back in school. I was preparing for programming olympiads, and of course my “gentleman's set” of preparation included the Knuth-Morris-Pratt algorithm, which allows you to find a substring of length in the length line for time.
So, what is a prefix function? Before we answer that question, let's get to know the function . This function maps a string to a number, which is the length of the largest suffix of the string which matches its prefix of the same length, but does not match the entire string .
Without examples it may not be clear right away. Let's consider, for example, the line . This line has prefixes , , , , and suffixes , , , , The intersection of prefixes and suffixes are strings , And . The longest of these lines that is not the original line is the line its length is 3. Accordingly, .
Other examples: , , .
Write a function that calculates You can, for example, like this (C++ language):
int pi(string s) {
int n = s.size();
for (int suffix_len = n - 1; suffix_len > 0; suffix_len--) {
string suffix = s.substr(n - suffix_len, suffix_len);
string prefix_same_len = s.substr(0, suffix_len);
if (suffix == prefix_same_len) {
return suffix_len;
}
}
return 0;
}
We go through all the suffixes of the string in descending order of their length, for each suffix we find a prefix of the same length and compare them. As soon as we find a matching suffix and prefix, we return their length. Such a function will obviously work in but this doesn’t bother us yet, at this stage we just want to figure out the definition.
Now that we have learned and realized what it is it will be much easier to understand what a prefix function is. Let's denote it . So, the prefix function of a string is an array of numbers Where – string prefix to -th character inclusive, .
That is, the prefix function is array of numbers for all string prefixes (the number of prefixes of a string is equal to its length, i.e. the size of the array is equal to the length of the string).
For example"ababa") = [\pi("a")” alt=”P("ababa") = [\pi("a")” src=”https://habrastorage.org/getpro/habr/upload_files/b92/565/c2a/b92565c2a0413b867ff7f26cb28cd278.svg” width=”207″ height=”22″/>, , , , .
Another example: .
Before moving on, make sure you can look at a string and mentally calculate its prefix function.
Step 2. Application
How can a prefix function help find a substring in a string? The idea is pretty nice: let's glue a substring and a string together, inserting a character between them that is guaranteed not to appear in either the substring or the string. Next, let's calculate the prefix function for the resulting glued string. Then, if at some point the value of the prefix function becomes equal to the length of the sought substring, this will mean that we have found an occurrence of the substring in the string.
Example: searching for a substring in line . We assume that the symbol does not occur in the input alphabet. By gluing the strings together, we obtain the string .
Its prefix function is equal to . We see that in the function we received twice which is the length of the desired substring. Therefore, these positions are where the substring ends in the original string.
Let me explain in a little more detail. Let the length of the desired substring be equal to . We found such a prefix of the glued string for which . That is, we found the length suffix matched with the length prefix . Length prefix – this is our desired substring. The length suffix is guaranteed to be completely contained in the original string in which we are searching for occurrences. This is guaranteed to us by the added service character . Thanks to it, the prefixes and suffixes of the glued string do not intersect (except for the string itself). Therefore, the found suffix is an occurrence of the substring in the string.
Well, if we learn to calculate the prefix function in linear time from the length of the string, then we will get an algorithm for finding a substring of length in the length line working for time.
Stage 3. Misunderstanding
While still in school, I learned that the prefix function can be computed in linear time. Without further ado, here is an algorithm that does this:
vector<int> prefix_function(string s) {
int n = s.size();
vector<int> pi(n);
pi[0] = 0;
for (int i = 1; i < n; i++) {
int k = pi[i - 1];
while (k > 0 && s[i] != s[k]) {
k = pi[k - 1];
}
if (s[i] == s[k]) {
k++;
}
pi[i] = k;
}
return pi;
}
Here pi[i]
– This in our notations. That is, pi[0] = 0
– quite obvious, because the function from the length string always equal .
The difficulty in understanding what is happening begins when we find ourselves in a cycle:
By what logic does the meaning change?
k
? And why in the endk
turns out to be equalpi[i]
?Why, despite the cycle
while
inside the loopfor
does the algorithm run in linear time?
When I was a young schoolboy, I tried to understand these issues, but I did not succeed, because I was not interested and motivated enough. Only in college did I understand the beauty of strict mathematical proofs and stopped taking unproven statements on faith (if they are not axioms). And at school, I simply believed that this function works, works quickly (after all, tests pass), memorized how to write it, and periodically applied it when solving problems.
When I was a student, I never got around to figuring out the prefix function. It was only when I became a programming teacher, whose course included the Knuth-Morris-Pratt algorithm, that I found enough motivation to finally dot the i's and cross the t's. pi
.
Now, if you have no other questions other than the two written above, let's try to answer them.
Stage 4. Observation
I have been dealing with the prefix function using two articles: e-maxx and on neerc.ifmo. They overlap in many ways and derive a linear algorithm in the same way. What follows is simply a more detailed retelling of these articles, where I will try to analyze each point with examples and drawings that helped me personally in understanding.
In both papers, the authors make the following observation, which helps to derive the algorithm:
.
Or, in code language: pi[i + 1] <= pi[i] + 1
.
That is meaning cannot increase by more than compared to the value on the previous prefix.
This is proved by contradiction. Let us assume that a case is possible in which
pi[i + 1] > pi[i] + 1
.
Or, in other words, pi[i] < pi[i + 1] - 1
.
In order for this inequality to be satisfied, pi[i + 1]
must be equal to at least otherwise we will get the value pi[i]
less than zero, which is impossible by definition.
So let it be pi[i + 1] = k
, k >= 2
.
That is, the length suffix k
matches the length prefix k
. However, then, taking a step back, we can say that the previous prefix has a length suffix k - 1
matches the length prefix k - 1
In the figure below, from the fact that the red areas coincide, it follows that the yellow ones also coincide.
This means that pi[i] >= k - 1
that is pi[i] >= pi[i + 1] - 1
. We came to a contradiction.
How does this observation help us to calculate the prefix function? Suppose we are at step i
having already calculated pi
for all values less i
. Then, based on observation, we have two possible options:
pi[i] = pi[i - 1] + 1
,pi[i] <= pi[i - 1]
.
Hidden text
While writing the article, I had a question: is it possible that pi[i] == pi[i - 1]?
After thinking a little, I found an example:
In what case will the first equality be satisfied? Only if the prefix and suffix that matched in the previous step have increased and continue to match in the current step. That is, if s[i] == s[pi[i - 1]]
In the picture below, the yellow area increases to red only if the green symbols match.
Let's write some code that uses this observation. If the compared characters do not match, we will call the slow function written in the first stage. Only now can we be sure that when calling the slow function pi[i]
will not exceed pi[i - 1]
. We can pass this number to the function to limit the enumeration:
int pi_slow(string s, int max_suffix_len) {
int n = s.size();
for (int suffix_len = max_suffix_len; suffix_len > 0; suffix_len--) {
string suffix = s.substr(n - suffix_len, suffix_len);
string prefix_same_len = s.substr(0, suffix_len);
if (suffix == prefix_same_len) {
return suffix_len;
}
}
return 0;
}
vector<int> prefix_function(string s) {
int n = s.size();
vector<int> pi(n);
pi[0] = 0;
for (int i = 1; i < n; i++) {
int k = pi[i - 1];
if (s[i] == s[k]) {
k++;
} else {
k = pi_slow(s.substr(0, i + 1), k);
}
pi[i] = k;
}
return pi;
}
I specifically indicated pi[i - 1]
for k
so that the naming is the same as in the code we want to end up with (see step 3). Now it becomes clearer where the condition came from in that code
if (s[i] == s[k]) {
k++;
}
Now we just need to figure out how to speed up the case when
s[i] != s[pi[i - 1]]
.
Step 5. Optimization
So, we are in this situation (green and blue symbols do not match):
Now we want to find the maximum prefix that will match the suffix ending in s[i]
. We can see that such a prefix, except for the last character, will completely match some suffix of the yellow area. In the picture below: the purple areas match, pi[i]
will be equal to the length of the purple area together with the blue symbol.
Now if we learn how to quickly find the maximum purple region and check if the next character after it is equal to s[i]
(is it blue), then we significantly optimize our search. How can we quickly find such a purple area? Try to figure it out yourself before reading on.
The key to understanding is that we must not forget about the equality of the yellow areas, we must use this fact. We can complete the picture as follows:
In the previous figure, it was clear that the purple region is both at the beginning and at the end of the yellow one. Now we simply draw it in both yellow regions on both sides. After that, it becomes obvious that the maximum purple region will be equal to the function from the yellow area (by definition of the function ).
We have already calculated this function in one of the previous steps, because the yellow area is one of the prefixes of our string. The yellow area, which is a prefix, ends at the index pi[i - 1] - 1
so the function from it lies in
pi[pi[i - 1] - 1]
.
What do we do if the blue symbols don't match? Look for the next matching area!
Let's remember what k
we designated pi[i - 1]
then the transition to the next suitable area will look like this k = pi[k - 1]
. I'll leave one last drawing before I run out of colors:
How long can we reduce it? k
? Until it becomes equal . When k
will be equal to zero, you need to check if the first character of the string is equal to blue (s[i]
). If equal, then pi[i]
will be equal .
As a result, we get the following code:
vector<int> prefix_function(string s) {
int n = s.size();
vector<int> pi(n);
pi[0] = 0;
for (int i = 1; i < n; i++) {
int k = pi[i - 1];
while (k > 0) {
if (s[i] == s[k]) {
k++;
break;
}
k = pi[k - 1];
}
if (k == 0 && s[i] == s[0]) {
k = 1;
}
pi[i] = k;
}
return pi;
}
If you compare it carefully with the code I provided in step 3, you can see that they are completely equivalent:
vector<int> prefix_function(string s) {
int n = s.size();
vector<int> pi(n);
pi[0] = 0;
for (int i = 1; i < n; i++) {
int k = pi[i - 1];
while (k > 0 && s[i] != s[k]) {
k = pi[k - 1];
}
if (s[i] == s[k]) {
k++;
}
pi[i] = k;
}
return pi;
}
In the final version inside while
there is no increase k
it is moved to the end and combined with the check in case k == 0
.
So, I hope I have managed to answer the first of the questions formulated in step 3:
By what logic does the meaning change?
k
? And why in the endk
turns out to be equalpi[i]
?
There's only one left!
Step 6. Proof of linearity
Why, despite the cycle
while
inside the loopfor
does the algorithm run in linear time?
Actually, the answer to this question is much easier than the first. However, all the evidence for linearity that I have seen was formulated in such a way that my head hurt when I tried to comprehend it.
Now I will try to formulate the same idea that is in many other sources, but in other words, more understandable to me personally. I hope this will help you too.
So, inside the loop for
we are engaged in increasing and decreasing the variable k
. Moreover, we always increase it only by one in one pass, and decrease it several times within the cycle. while
by an unknown amount.
Just in case, I'll explain why it's inside while
variable k
necessarily decreases: pi[k - 1] < k
because by definition (function less than the length of the input string).
Next, we note that at the end of the loop body we write pi[i] = k
and at the beginning of the next iteration – k = pi[i - 1]
. That is between iterations value k
does not changeThis is a very important observation, the absence of which prevented me for a long time from realizing the proof.
From this observation we conclude that the variable k
for all iterations will increase by a maximum of because we have everything iterations, in each the increase occurs by a maximum of and between iterations the variable k
does not change. That is, the maximum possible value k
– This (initially we have k = pi[0] = 0
).
It follows that the variable will decrease k
maybe the same at most times. Because otherwise it would be less than zero, which is impossible.
And from this it follows that the total number of iterations of the inner loop while
does not exceed because within each iteration we are guaranteed to decrease k
.
Q.E.D.
Step 7. Conclusion
This is where our journey to enlightenment comes to an end.
I hope that everything written in the article was true and understandable.
Thank you for reading to the end!