String algorithms in practice. Part 3 – Rabin-Karp Algorithm

Today we will analyze an ingenious and non-trivial algorithm for finding a substring in a string. It is based not on comparing characters, but on comparing numbers. I already wrote that my main goal is not to write a simple analysis of algorithms, but to see their effectiveness, some interesting places and compare their performance with each other.
And today there is something to see.

Part 1 – Knuth-Morris-Pratt Algorithm
Part 2 – Boyer-Moore Algorithm

Algorithm device

In general, the classical implementation uses a polynomial hash, but in general, any ring hash function will do. But we will take the polynomial version.

The algorithm is extremely simple: we take the hash of the pattern, we take the hash of a piece of text of equal length, we compare them. If they are equal, then we check them character by character, otherwise we move on.

What does it all mean

Everything is very simple.

We have a binary alphabet where $a = 0, b = 1$.

Let’s calculate the hash of the pattern:

$H(P) = 1 * 2^4 + 0 * 2^3 + 1 * 2^2 + 0 * 2^1 + 1 * 2^0 = 21$

Calculate the hash of a substring $T[0, 4]$.

$H(T[0, 4]) = 0 * 2^4 + 0 * 2^3 + 1 * 2^2 + 0 * 2^1 + 1 * 2^0 = 5$

They do not match, so we move the pattern one symbol to the right.

Calculate the hash of a substring $T[1, 5]$.

$H(T[1, 5]) = (H(T[0, 4]) - 0 * 2^4) * 2 + 0 * 2^0 = 10$

The hashes of the substring and the pattern do not match again, so we move the pattern one more character to the right.

Calculate the hash of a substring $T[2, 6]$.

$H(T[2, 6]) = (H(T[1, 5]) - 0 * 2^4) * 2 + 1 * 2^0 = 21$

The hashes matched, so we check the characters line by line.

Hash calculation

To match two strings, we actually turn them into numbers by simply translating them into an arbitrary number system, where the position of the character is the digit, its code is the value of the digit, and the power of the alphabet is the number of characters in the encoding (in general, this is not necessary, you can take any number greater than 1, but then collisions will occur more often.The larger the number, the better, but the main thing is not to run into an overflow🙂).

In general, the hash calculation will look something like this:

$H(P) = p[0] * x^{m-1} + p[1] * x^{m-2} + ... + p[m-1] *x^0$

$H(P)$ – pattern hash

$x$ is some natural number, for example, the cardinality of the alphabet.

Now let’s calculate the hash of the real string. Let’s take a string pattern and for the sake of clarity x take the power of a seven-bit ASCII table – 128.

$H(Pattern) = 80 * 128^6 + 97 * 128^5 + 116 * 128^4 + 116 * 128^3 \\ + 101 * 128^2 + 114 * 128^1 + 110 * 128^0$

Result: 355 207 998 962 030

Looks too much. To write such a number in binary form, you need 49 bits, and from the source data we only have a pattern of 7 characters long and a truncated ASCII encoding. If we take the eight-bit version, then there will be overflow even on 64-bit systems. So let’s help solve this issue. Horner’s scheme and modular arithmetic.

Horner’s scheme

$H(P) = p[m-1] * x^{m-1} + p[m-2] * x^{m-2} + ... + p[0] * x^0 = \\ (...(p[m-1] *x+p[m-2]) * x + ... + p[1]) * x + p[0]$

Its task is to help us get rid of exponentiation, since exponentiation is a relatively voracious arithmetic operation. Also, without Horner’s scheme, the hashing algorithm would become less optimal. Since we will have a large number of duplicate calculations.

And no modular arithmetic will help us if we have a pattern with a length of 128 ASCII characters. In this case, with the usual exponentiation, we will use the number to hash the first character in the pattern $2^{1024}$and this, as you know, is a lot.

Modular arithmetic

In order to avoid overflow, but at the same time to obtain correct hashing results, we will take the remainder of the division for all intermediate results of calculations.
So our final algorithm should look something like this:

$H(P) = (((...((p[m-1] *x+p[m-2])\mod q) * x \\ + ... + p[1]) \mod q) * x + p[0]) \mod q$

At all intermediate stages, we will obtain values ​​not exceeding

$p_{max} * (x + 1)$.

But there is one caveat:

When you manage to shove the unpushable, you piss off the god of collisions

Escape from the god of collisions

If you start digging into research on the topic of collisions in hash functions, then naturally you can not get out of there. In short, the number q should be big and simple. And to be more specific, here it is:

More details

We have text $t[0...n]$ long $len = n + 1$. If we take

If you go over the first thousand prime numbers and plot $N(Q)$where $N$ is the number of collisions, it will be like this:

here

Here I tested on the passage about the oak tree from War and Peace, where I searched for the line broken off. Yes, the pattern is not very large, the numbers are also not the largest, but we can see the dynamics of the number of collisions. And no matter how trivial it is, we got the usual asymptote. Well, as we can see after 7000, the worst thing we get is 1 collision. Of course, it is not a fact that there will be no further surge. But since 2000, the worst thing we’ve had is 4 collisions per text of 1671 characters.

Everything is logical, the larger the prime number, the less we shrink the hash, the less collisions. After all, when we take a range of numbers $[0; 2^{1024}]$(and we can get something like this hash without modular arithmetic), and try to shrink it into the range $[0, 2^{32}-1]$, then it is obvious that we have to pay something for this. And the price of this conflict.

Algorithm code

export function getSubstringRK(text: string, pattern: string): number[] {
    const result = [];

    const alphabetSize = 256;
    const mod = 9973;

    let patternHash = pattern.charCodeAt(0) % mod;
    let textHash = text.charCodeAt(0) % mod;

    let firstIndexHash = 1;

    for(let i = 1; i < pattern.length; i++)
    {
        patternHash *= alphabetSize;
        patternHash += pattern.charCodeAt(i);
        patternHash %= mod;

        textHash *= alphabetSize;
        textHash += text.charCodeAt(i);
        textHash %= mod;

        firstIndexHash *= alphabetSize;
        firstIndexHash %= mod;
    }

    for (let i = 0; i <= text.length - pattern.length; i++) {
        if (patternHash === textHash && compareText(text, i, pattern)) {
            result.push(i);
        }

        if (i === text.length - pattern.length) break;

        textHash -= (text.charCodeAt(i) * firstIndexHash) % mod;
        textHash += mod;
        textHash *= alphabetSize;
        textHash += text.charCodeAt(i + pattern.length);
        textHash %= mod;
    }

    return result;
}

function compareText(text: string, index: number, pattern: string): boolean {
    for (let i = 0; i < pattern.length; i++) {
        if (pattern[i] !== text[index + i]) {
            return false;
        }
    }

    return true;
}

Performance measurements

The contests are the same

Screaming lines

Measurements where both the text and the pattern consist only of the letter “a”.

Text: string 1024 characters long
Sample: string 32 characters long

Ops/sec

getSubstringNaive x 9,141
getSubstringKMP x 84,419
getSubstringNotSoNaive x 9,587
getSubstringBMBadCharacter x 9,103
getSubstringRK x 9,975

Number of comparisons

getSubstringNaive : 31,776
getSubstringKMP : 2,047
getSubstringNotSoNaive : 31,776    
getSubstringBMBadCharacter : 31,776
getSubstringRK : 32,769 

The Rabin-Karp algorithm in terms of complexity in the worst case has the complexity $O(n*m)$, does not stand out much in terms of running time, but at the same time makes more comparisons. After all, before we check the string, we must first check the hash. So here is our experimental outsider.

Pseudo DNA

A string consisting of 1024 alphabetic characters [TGAC].

Sample: GTAGTGTGTCTACGTCTTTCTTTGACAGTACCGCGTA. (37 characters)

Ops/sec

getSubstringNaive x 6,349
getSubstringKMP x 156,780
getSubstringNotSoNaive x 189.694
getSubstringBMBadCharacter x 199,476
getSubstringRK x 229,361

Number of comparisons

getSubstringNaive : 36,556
getSubstringKMP : 1,422
getSubstringNotSoNaive : 1,434    
getSubstringBMBadCharacter : 925
getSubstringRK : 1,136

Despite the fact that the Rabin-Karp algorithm is in second place, in terms of the number of comparisons, it is ahead of Boyer-Moore. Well, in general, BM with a bad symbol heuristic is not so effective on generated strings.

War and Peace

Text:

Essay about oak

There was an oak at the edge of the road. Probably ten times older than the birches that made up the forest, it was ten times thicker, and twice as tall as each birch. It was a huge, two-girth oak, with boughs broken off, apparently long ago, and with broken bark, overgrown with old sores. With his huge clumsy, asymmetrically spread out clumsy hands and fingers, he stood between smiling birches like an old, angry and contemptuous freak. Only he alone did not want to submit to the charm of spring and did not want to see either spring or the sun.

“Spring, and love, and happiness! – as if said this oak. “And how do you not get tired of the same stupid senseless deceit! Everything is the same, and everything is a lie! There is no spring, no sun, no happiness. There, look, crushed dead fir trees are sitting, always the same, and there I spread my broken, peeled fingers, wherever they grew – from the back, from the sides. As you have grown, so I stand, and I do not believe your hopes and deceptions.

Prince Andrei looked back at this oak tree several times as he rode through the forest, as if he was expecting something from him. There were flowers and grass under the oak, but he still, frowning, motionless, ugly and stubbornly, stood in the middle of them.

“Yes, he is right, this oak tree is a thousand times right,” thought Prince Andrei, “let others, young people, again succumb to this deception, and we know life, our life is over! A whole new series of hopeless, but sadly pleasant thoughts arose in the soul of Prince Andrei in connection with this oak tree. During this journey, it was as if he thought over his whole life again and came to the same old, reassuring and hopeless conclusion that he did not need to start anything, that he should live out his life without doing evil, without worrying and desiring nothing. .

Patterns:

  1. oak
  2. Andrei
  3. broken off

results

Ops/sec

Number of comparisons

On real text, the algorithm holds the bar for $O(n*m)$its performance is approximately the same as that of the KMP algorithm and ± the same number of comparisons.

findings

This is far from the fastest reading algorithm, but it has one cool feature: it can perform inaccurate searches. Since it doesn’t do character comparisons directly, it can quickly search the stream for examples of occurrences of a particular pattern. So we can make a system that will remove all punctuation marks, convert text to lowercase and look for the occurrence of a paragraph in the text. Yes, if you take it easy, then this can be done without the RK algorithm, but then it will not be so optimal in terms of time and memory.

Similar Posts

Leave a Reply