Using Boyer-Moore-Horspool Algorithm in Java with Example Solution from LeetCode

Horspool's algorithm is used to find a substring in a string. For example, we have a string “The game is over” and a substring “over”. Horspool's algorithm will return the value of the first occurrence of the substring “over” in the string “The game is over”, which is 12.

In fact, this algorithm is a simplified Boyer-Moore algorithm, which is believed to perform better than the standard algorithm on random texts, but its worst-case speed is |needle| * |haystack| instead of 3 x |haystack|.

However, in my opinion, it is much simpler to perceive.

So, let's go.

Problem condition with leetcode: https://leetcode.com/problems/find-the-index-of-the-first-occurrence-in-a-string/description/

How does the algorithm work?

The string and substring are matched at the first character, and comparisons begin from the last character to the first.

For example, let's take the string: “aabcdadbc” and the substring “adb”

The lines are combined as follows (from left to right):

After combining the lines, a comparison starts from right to left; if the character matches, the previous one is compared until the substring ends.

In our case, the comparison starts with the character “b”, in the string and the substring this character is the same, so we move on to the previous one: the string is “a”, the substring is “d”, i.e. they are not equal, and therefore we can shift our substring by a certain number of characters, which we will take from the offset table.

An offset table is a thing that is built for a substring, where each character is assigned its own number of the maximum possible offset, starting from the end and excluding the last character. For all other characters that are not in the substring, and for the last character of the substring, the size of the substring is assigned as a number.

For example, for the substring “over” the offset table will look like this:

o

v

e

All other elements, including r

3

2

1

4

If elements are repeated, for example aababd, then the offset will be assigned to the last matching character:

You can simplify it to:

For our example, the offset table will be as follows:

In the string, the value that did not match the searched value is “a”, for which the offset is 2, i.e. we can offset the search in our string by 2 elements:

Now we start comparing again from right to left: “d” and “b” are not equal to each other, so we go to the offset table and find the value for the character from the string “d”, it is 1. We shift our substring by 1 to the right for further comparison.

Again we compare the last characters “a” and “b”, they are not equal, we find the offset for “a” from the offset table – 2, we shift the substring 2 characters to the right:

We start the comparison again: b = b, d = d, a = a. We see that the substring completely matches the string, and we return “5” – the index of the character “a”, which is the first character of the match between the string and the substring.

Now let's implement this code in Java:

public static int strStr(String haystack, String needle) {

    int haystackLen = haystack.length();

    int needleLen = needle.length();

    // создаем таблицу смещений для символов искомого текста

    Map<Character, Integer> offsetTable = new HashMap<>();

    for (int i = 0; i < needleLen - 1; i++) {

        offsetTable.put(needle.charAt(i), needleLen - i - 1);

    }

    int shift = 0; // смещение от начала строки

    while (shift <= haystackLen - needleLen) {

        int j = needleLen - 1;

        // ищем с конца подстроки к началу

        while (j >= 0 && needle.charAt(j) == haystack.charAt(shift + j)) {

            j--;

        }

        // если подстрока найдена, возвращаем значение смещения

        if (j < 0) {

            return shift;

        } else {

            // Рассчитываем сдвиг на основе измененной эвристики стоп-символа

            shift += Math.max(1, offsetTable.getOrDefault(haystack.charAt(shift + j), needleLen) - 1);

        }

    }

    // если подстрока не найдена

    return -1;

}

In general, I commented on each step in the code itself, but I would like to draw attention to the line:

shift += Math.max(1, offsetTable.getOrDefault(haystack.charAt(shift + j), needleLen) – 1); and analyze it in more detail.

haystack.charAt(shift + j) – find what character is in the string taking into account the shift. In the first step of our example, we found the character “a” in the string (the character of the first mismatch), here:

Let's look further: offsetTable.getOrDefault(haystack.charAt(shift + j), needleLen) – 1 – since we only placed shifts for characters from the substring in our table, for all the rest we will take the value “needleLen”, i.e. the length of the substring.

Then we decrease our shift by one, because above there is a comparison needle.charAt(j) == haystack.charAt(shift + j), where j is the length of the substring minus 1. If we do not decrease the shift by 1, then we will compare incorrect characters.

At the end we take either the value of the maximum possible shift, or 1, since the maximum possible shift can become equal to 0 (you can try running the example where the string is “abc” and the substring is “c”) and we get:

Math.max(1, offsetTable.getOrDefault(haystack.charAt(shift + j), needleLen) – 1);

In short, this is the thing, the Boyer-Moore-Horspool algorithm.

In the worst case, it will of course give quadratic time complexity (more precisely O(n * m)), but in practice, due to the offset table, it works much more efficiently.

Although on leetCode the algorithm gave the same 1ms Runtime as a regular enumeration (heh):

public static int strStr(String haystack, String needle) {

    if (haystack.length() < needle.length()) {

        return -1;

    }

    for (int i = 0; i <= haystack.length() - needle.length(); i++) {

        int j = 0;

        while (haystack.charAt(i + j) == needle.charAt(j)) {

            j++;

            if (j == needle.length()) {

                return i;

            }

        }

    }

    return -1;

}

But in the long text tests, of course, Horspool won.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *