Doctors Knuth, Morris and Pratt, or How I Learned to Stop Worrying and Love the Prefix Function

If you don't know what a string prefix function is, don't know how it's calculated, or most importantly, don't fully understand why the algorithm for calculating the prefix function works in linear time, then this article is for you.

I went through a series of realizations and insights before reaching enlightenment, and now I invite you to walk this path with me.

Step 1. Definition

My first encounter with the prefix function was back in school. I was preparing for programming olympiads, and of course my “gentleman's set” of preparation included the Knuth-Morris-Pratt algorithm, which allows you to find a substring of length m in the length line n for O(n+m) time.

So, what is a prefix function? Before we answer that question, let's get to know the function \pi(s). This function maps a string to a number, which is the length of the largest suffix of the string swhich matches its prefix of the same length, but does not match the entire string s.

Without examples it may not be clear right away. Let's consider, for example, the line ababa. This line has prefixes a, ab, aba, abab, ababa and suffixes a, ba, aba, baba, ababaThe intersection of prefixes and suffixes are strings a, aba And ababa. The longest of these lines that is not the original line is the line abaits length is 3. Accordingly, \pi("ababa") = 3.

Other examples: \pi("aaab") = 0, \pi("aaa") = 2, \pi("abba") = 1.

Write a function that calculates \pi(s) You can, for example, like this (C++ language):

int pi(string s) {
    int n = s.size();
    for (int suffix_len = n - 1; suffix_len > 0; suffix_len--) {
        string suffix = s.substr(n - suffix_len, suffix_len);
        string prefix_same_len = s.substr(0, suffix_len);
        if (suffix == prefix_same_len) {
            return suffix_len;
        }
    }
    return 0;
}

We go through all the suffixes of the string in descending order of their length, for each suffix we find a prefix of the same length and compare them. As soon as we find a matching suffix and prefix, we return their length. Such a function will obviously work in O(n^2)but this doesn’t bother us yet, at this stage we just want to figure out the definition.

Now that we have learned and realized what it is \pi(s)it will be much easier to understand what a prefix function is. Let's denote it P(s). So, the prefix function of a string is an array of numbers \pi(s[0..i])Where s[0..i] – string prefix s to i-th character inclusive, i = 0, 1, 2, \ldots, n - 1.

That is, the prefix function is array of numbers \pi for all string prefixes (the number of prefixes of a string is equal to its length, i.e. the size of the array is equal to the length of the string).

For example"ababa") = [\pi("a")” alt=”P("ababa") = [\pi("a")” src=”https://habrastorage.org/getpro/habr/upload_files/b92/565/c2a/b92565c2a0413b867ff7f26cb28cd278.svg” width=”207″ height=”22″/>, \pi("ab"), \pi("aba"), \pi("abab"), \pi("ababa")]= [0, 0, 1, 2, 3].

Another example: P("abacababa") = [0,0,1,0,1,2,3,2,3].

Before moving on, make sure you can look at a string and mentally calculate its prefix function.

Step 2. Application

How can a prefix function help find a substring in a string? The idea is pretty nice: let's glue a substring and a string together, inserting a character between them that is guaranteed not to appear in either the substring or the string. Next, let's calculate the prefix function for the resulting glued string. Then, if at some point the value of the prefix function becomes equal to the length of the sought substring, this will mean that we have found an occurrence of the substring in the string.

Example: searching for a substring aba in line abacaba. We assume that the symbol \# does not occur in the input alphabet. By gluing the strings together, we obtain the string aba\#abacaba.

Its prefix function is equal to [0,0,1,0,1,2,3,0,1,2,3]. We see that in the function we received twice 3which is the length of the desired substring. Therefore, these positions are where the substring ends in the original string.

Let me explain in a little more detail. Let the length of the desired substring be equal to k. We found such a prefix of the glued string s[0..i]for which \pi(s[0..i]) = k. That is, we found the length suffix kmatched with the length prefix k. Length prefix k – this is our desired substring. The length suffix k is guaranteed to be completely contained in the original string in which we are searching for occurrences. This is guaranteed to us by the added service character \#. Thanks to it, the prefixes and suffixes of the glued string do not intersect (except for the string itself). Therefore, the found suffix is ​​an occurrence of the substring in the string.

Well, if we learn to calculate the prefix function in linear time from the length of the string, then we will get an algorithm for finding a substring of length m in the length line nworking for O(n+m) time.

Stage 3. Misunderstanding

While still in school, I learned that the prefix function can be computed in linear time. Without further ado, here is an algorithm that does this:

vector<int> prefix_function(string s) {
    int n = s.size();
    vector<int> pi(n);
    pi[0] = 0;
    for (int i = 1; i < n; i++) {
        int k = pi[i - 1];
        while (k > 0 && s[i] != s[k]) {
            k = pi[k - 1];
        }
        if (s[i] == s[k]) {
            k++;
        }
        pi[i] = k;
    }
    return pi;
}

Here pi[i] – This \pi(s[0..i]) in our notations. That is, pi[0] = 0 – quite obvious, because the function \pi from the length string 1 always equal 0.

The difficulty in understanding what is happening begins when we find ourselves in a cycle:

  • By what logic does the meaning change? k ? And why in the end k turns out to be equal pi[i] ?

  • Why, despite the cycle while inside the loop for does the algorithm run in linear time?

When I was a young schoolboy, I tried to understand these issues, but I did not succeed, because I was not interested and motivated enough. Only in college did I understand the beauty of strict mathematical proofs and stopped taking unproven statements on faith (if they are not axioms). And at school, I simply believed that this function works, works quickly (after all, tests pass), memorized how to write it, and periodically applied it when solving problems.

When I was a student, I never got around to figuring out the prefix function. It was only when I became a programming teacher, whose course included the Knuth-Morris-Pratt algorithm, that I found enough motivation to finally dot the i's and cross the t's. pi .

Now, if you have no other questions other than the two written above, let's try to answer them.

Stage 4. Observation

I have been dealing with the prefix function using two articles: e-maxx and on neerc.ifmo. They overlap in many ways and derive a linear algorithm in the same way. What follows is simply a more detailed retelling of these articles, where I will try to analyze each point with examples and drawings that helped me personally in understanding.

In both papers, the authors make the following observation, which helps to derive the algorithm:

\pi(s[0..(i+1)]) \leqslant \pi(s[0..i]) + 1.

Or, in code language: pi[i + 1] <= pi[i] + 1 .

That is meaning \pi cannot increase by more than 1 compared to the value on the previous prefix.

This is proved by contradiction. Let us assume that a case is possible in which

pi[i + 1] > pi[i] + 1 .

Or, in other words, pi[i] < pi[i + 1] - 1 .

In order for this inequality to be satisfied, pi[i + 1] must be equal to at least 2otherwise we will get the value pi[i] less than zero, which is impossible by definition.

So let it be pi[i + 1] = k , k >= 2 .

That is, the length suffix k matches the length prefix k . However, then, taking a step back, we can say that the previous prefix has a length suffix k - 1 matches the length prefix k - 1 In the figure below, from the fact that the red areas coincide, it follows that the yellow ones also coincide.

Illustration of the proof of observation

Illustration of the proof of observation

This means that pi[i] >= k - 1 that is pi[i] >= pi[i + 1] - 1 . We came to a contradiction.

How does this observation help us to calculate the prefix function? Suppose we are at step i having already calculated pi for all values ​​less i . Then, based on observation, we have two possible options:

  • pi[i] = pi[i - 1] + 1 ,

  • pi[i] <= pi[i - 1] .

Hidden text

While writing the article, I had a question: is it possible that pi[i] == pi[i - 1]? After thinking a little, I found an example:

P("aabaaa") = [0, 1, 0, 1, 2, 2]

In what case will the first equality be satisfied? Only if the prefix and suffix that matched in the previous step have increased and continue to match in the current step. That is, if s[i] == s[pi[i - 1]] In the picture below, the yellow area increases to red only if the green symbols match.

Illustration of the application of observations

Illustration of the application of observations

Let's write some code that uses this observation. If the compared characters do not match, we will call the slow function written in the first stage. Only now can we be sure that when calling the slow function pi[i] will not exceed pi[i - 1] . We can pass this number to the function to limit the enumeration:

int pi_slow(string s, int max_suffix_len) {
    int n = s.size();
    for (int suffix_len = max_suffix_len; suffix_len > 0; suffix_len--) {
        string suffix = s.substr(n - suffix_len, suffix_len);
        string prefix_same_len = s.substr(0, suffix_len);
        if (suffix == prefix_same_len) {
            return suffix_len;
        }
    }
    return 0;
}

vector<int> prefix_function(string s) {
    int n = s.size();
    vector<int> pi(n);
    pi[0] = 0;
    for (int i = 1; i < n; i++) {
        int k = pi[i - 1];
        if (s[i] == s[k]) {
            k++;
        } else {
            k = pi_slow(s.substr(0, i + 1), k);
        }
        pi[i] = k;
    }
    return pi;
}

I specifically indicated pi[i - 1] for k so that the naming is the same as in the code we want to end up with (see step 3). Now it becomes clearer where the condition came from in that code

if (s[i] == s[k]) {
    k++;
}

Now we just need to figure out how to speed up the case when

s[i] != s[pi[i - 1]] .

Step 5. Optimization

So, we are in this situation (green and blue symbols do not match):

s[i] != s[pi[i - 1]]

s[i] != s[pi[i - 1]]

Now we want to find the maximum prefix that will match the suffix ending in s[i] . We can see that such a prefix, except for the last character, will completely match some suffix of the yellow area. In the picture below: the purple areas match, pi[i] will be equal to the length of the purple area together with the blue symbol.

Switch to a lower prefix

Switch to a lower prefix

Now if we learn how to quickly find the maximum purple region and check if the next character after it is equal to s[i] (is it blue), then we significantly optimize our search. How can we quickly find such a purple area? Try to figure it out yourself before reading on.

The key to understanding is that we must not forget about the equality of the yellow areas, we must use this fact. We can complete the picture as follows:

Extension of the previous drawing

Extension of the previous drawing

In the previous figure, it was clear that the purple region is both at the beginning and at the end of the yellow one. Now we simply draw it in both yellow regions on both sides. After that, it becomes obvious that the maximum purple region will be equal to the function \pi from the yellow area (by definition of the function \pi).

We have already calculated this function in one of the previous steps, because the yellow area is one of the prefixes of our string. The yellow area, which is a prefix, ends at the index pi[i - 1] - 1 so the function \pi from it lies in

pi[pi[i - 1] - 1] .

What do we do if the blue symbols don't match? Look for the next matching area!

Let's remember what k we designated pi[i - 1] then the transition to the next suitable area will look like this k = pi[k - 1] . I'll leave one last drawing before I run out of colors:

The last drawing

The last drawing

How long can we reduce it? k ? Until it becomes equal 0. When k will be equal to zero, you need to check if the first character of the string is equal to blue (s[i] ). If equal, then pi[i] will be equal 1.

As a result, we get the following code:

vector<int> prefix_function(string s) {
    int n = s.size();
    vector<int> pi(n);
    pi[0] = 0;
    for (int i = 1; i < n; i++) {
        int k = pi[i - 1];
        while (k > 0) {
            if (s[i] == s[k]) {
                k++;
                break;
            }
            k = pi[k - 1];
        }
        if (k == 0 && s[i] == s[0]) {
            k = 1;
        }
        pi[i] = k;
    }
    return pi;
}

If you compare it carefully with the code I provided in step 3, you can see that they are completely equivalent:

vector<int> prefix_function(string s) {
    int n = s.size();
    vector<int> pi(n);
    pi[0] = 0;
    for (int i = 1; i < n; i++) {
        int k = pi[i - 1];
        while (k > 0 && s[i] != s[k]) {
            k = pi[k - 1];
        }
        if (s[i] == s[k]) {
            k++;
        }
        pi[i] = k;
    }
    return pi;
}

In the final version inside while there is no increase k it is moved to the end and combined with the check in case k == 0 .

So, I hope I have managed to answer the first of the questions formulated in step 3:

By what logic does the meaning change? k ? And why in the end k turns out to be equal pi[i] ?

There's only one left!

Step 6. Proof of linearity

Why, despite the cycle while inside the loop for does the algorithm run in linear time?

Actually, the answer to this question is much easier than the first. However, all the evidence for linearity that I have seen was formulated in such a way that my head hurt when I tried to comprehend it.

Now I will try to formulate the same idea that is in many other sources, but in other words, more understandable to me personally. I hope this will help you too.

So, inside the loop for we are engaged in increasing and decreasing the variable k . Moreover, we always increase it only by one in one pass, and decrease it several times within the cycle. while by an unknown amount.

Just in case, I'll explain why it's inside while variable k necessarily decreases: pi[k - 1] < k because by definition \pi(s[0..i]) < i + 1 (function \pi less than the length of the input string).

Next, we note that at the end of the loop body we write pi[i] = k and at the beginning of the next iteration – k = pi[i - 1] . That is between iterations value k does not changeThis is a very important observation, the absence of which prevented me for a long time from realizing the proof.

From this observation we conclude that the variable k for all iterations will increase by a maximum of n - 1because we have everything n - 1 iterations, in each the increase occurs by a maximum of 1and between iterations the variable k does not change. That is, the maximum possible value k – This n - 1 (initially we have k = pi[0] = 0).

It follows that the variable will decrease k maybe the same at most n - 1 times. Because otherwise it would be less than zero, which is impossible.

And from this it follows that the total number of iterations of the inner loop while does not exceed n-1because within each iteration we are guaranteed to decrease k .

Q.E.D.

Step 7. Conclusion

This is where our journey to enlightenment comes to an end.

I hope that everything written in the article was true and understandable.

Thank you for reading to the end!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *