1. Overview

In computer science, string searching means finding the location of one or more strings (called patterns) in a large text.

In this article, we’ll discuss the Rabin-Karp string searching algorithm. First, we’ll explain the naive string searching algorithm. Next, we’ll explain the Rabin-Karp algorithm and apply it to an example.

Finally, we’ll discuss some variations of this algorithm.

2. Naive Algorithm

2.1. Pseudocode

The idea of this algorithm is simple. It works by trying to search for the pattern starting from every possible location in the text.

Consider the following algorithm that represents the naive approach:

Rendered by QuickLaTeX.com

We try to match the pattern P, starting from every possible position in T. For each starting position, we keep iterating over both the pattern P and the text T trying to match them.

If we reach a match, we increase the answer by one. Otherwise, we break the loop on the first mismatch we find.

2.2. Example

As an example, let’s look at the algorithm’s steps if we wanted to look for P="bab" in T="ababaac":

Rendered by QuickLaTeX.com

The Complexity of this algorithm is O(n \cdot k), where n is the length of the text, and k is the length of the pattern.

3. Rabin-Karp Algorithm

First, we need to have some background on hashing functions that are used in Rabin-Karp algorithms. After that, we’ll move to the Rabin-Karp algorithm itself.

3.1. Hash Function

Previously, we had to try matching the pattern from every single position in the text. To improve this, we’ll create a function that can map any string to an integer value.

In other words, we need a function H (called a hashing function), which takes a string s and maps it to an integer H(s)=x. From now on, we’ll call x the hash value of s.

However, not every function is useful. The hashing function should be easy to calculate. We should also be able to perform precalculation, and in the precalculation, we should be able to calculate the hash value for a string in linear time.

Moreover, after the preprocessing, we should be able to calculate the hash value of any substring s[i...j] in constant time.

3.2. Hash Function Example

Perhaps one of the simplest hashing functions we can use is the sum of ASCII codes of the letters in the string. For simplicity, we’ll use int(x) to denote the ASCII code of x. Formally:

  H(s[0...n-1])=int(s[0])+int(s[1])+....+int(s[n-1])

Since we can quickly calculate the hash value of a substring using prefix sums, this hashing function is a valid choice.

Let’s look at a few examples, assuming s="aabbab":

  • H(s[2...4]) = H("bba") = 98 + 98 + 97 = 293
  • H(s[3...5]) = H("bab") = 98 + 97 + 98 = 293

Notice that H(s[2...4])=H(s[3...5]) even though s[2...4] \neq s[3...5]. This case, when two distinct strings have the same hash value, is called a collision.

As a rule, the fewer collisions a hashing function causes, the better it is. Our hashing function has lots of collisions since it doesn’t take into account the order and position of letters. Later, we’ll discuss better hashing functions.

Check the following implementation of a hash structure using the discussed hashing function:

Rendered by QuickLaTeX.com

The first function Init performs the precalculation on the given string. We iterate over the string s and calculate the prefix sum of the ASCII values of the string’s characters.

The second function getHash is used to calculate the hash value of a given range [L, R]. To do this, we return the prefix sum at the end of the range, after subtracting the prefix sum of the beginning of the range. Therefore, we get the sum of values in the range [L, R].

The complexity of the precalculation function is O(n), where n is the length of the string. Also, the complexity of the hashing calculation function is O(1).

3.3. Rabin-Karp Main Idea

Earlier, we explained what a hashing function is. The most important property of hashing functions is that if we have two strings s_1 and s_2 and H(s_1) \neq H(s_2), we can safely assume that s_1 \neq s_2.

However, if H(s_1) = H(s_2), we can’t guarantee that s_1 = s_2 because it’s possible that two distinct strings have the same hash value.

Now, let’s take a look at the Rabin-Karp string searching algorithm:

Rendered by QuickLaTeX.com

Assume that we’re looking for the pattern P in the text T. We’re currently at location i in T, and we need to check if T[i...i+k-1] is equal to P[0...k-1], where k is equal to the length of P.

If H(P[0...m-1]) \neq H(T[i...i+m-1]) then we can assume that the two strings aren’t equal, and skip the matching attempt at this location. Otherwise, we should attempt to match the strings character by character.

The worst-case complexity of this algorithm is still O(n \cdot k), where n is the length of the text, and k is the length of the pattern.

However, if we use a good hashing function, the expected complexity would become O(n + k \cdot t), where t is the number of matches.

The reason behind this is that a good hashing function would rarely cause collisions. Therefore, we’d rarely need to compare two substrings when there is no match. Furthermore, if we only want to check if the pattern exists or not, the complexity would become O(n+k), because we can break after the first occurrence.

3.4. Rabin-Karp Example

To understand how Rabin-Karp works, let’s revisit the example we used in section 2.2. Unlike the naive algorithm, we do not need to match the pattern at every location.

Let’s take a look at the matching process:

Rendered by QuickLaTeX.com

In comparison to the naive algorithm, we only needed to check 2 out of 5 positions.

In the next section, we’ll discuss a better hashing function that can get rid of most of the false-positives. Thus, it’ll reduce the number of locations that we need to check.

4. A Good Hashing Function

4.1. Definition

Earlier, we discussed a hashing function based on the sum of ASCII codes. However, that hashing function didn’t take into account the order or the positions of the letters. For example, any permutation of the string will have the same hash value.

Instead, we’ll try to represent strings as numbers in base B, where B is a prime number larger than all ASCII codes.

As an example, assume we have the string s="abac". Its representation in base B would be:

int('a') \times B^3 + int('b') \times B^2 + int('a') \times B + int('c')

In general, a string s with length n would have the following hash value:

int(s[0])\times B^{n-1} + int(s[1]) \times B^{n-2} + ... + int(s[n-2]) \times B + int(s[n-1])

Since we’re representing the string in a valid number system, hash values for distinct strings will be distinct. However, the hash values would become huge when representing a long string. Therefore, it would be inefficient to compute and store them.

Instead, we’ll calculate the hash value modulo m, where m is a large prime (usually around 10^9). The larger the prime is, the fewer collisions it would cause.

4.2. Implementation

As discussed earlier, we should be able to calculate the hash value for any substring in constant time after preprocessing.

Let’s try to calculate the hash value for every prefix in a string s of length 5:

Rendered by QuickLaTeX.com

Notice that we have prefix[i] = prefix[i-1] \times B + int(s[i]). This will allow us to calculate the prefix array in linear time. Also, we’d need to calculate an array Pow that stores the different powers of the prime B.

However, we still need to find a way to calculate the hash value of any substring in constant time.

As an example, let’s say we want to calculate the value of H(s[2...4]). The result should be int(s[2]) \times B^2 + int(s[3]) \times B + int(s[4]).

This value is available in prefix[4], but we need to remove the terms corresponding to s[0] and s[1]. These two terms are available in prefix[1]. However, when moving from prefix[1] to prefix[4], they were multiplied 4-1=3 times by B.

To remove them, we need to multiply prefix[1] by B^3 and then subtract it from prefix[4]. Therefore, H(s[2...4]) = prefix[4] - prefix[1] \times B^{4-1}.

As a rule, H(s[L...R]) = prefix[R] - prefix[L-1] \times B^{R-L+1}.

Now that we have these equations ready, let’s implement the following hashing structure:

Rendered by QuickLaTeX.com

The complexity of the precalculation is O(n), where n is the length of the string. Also, the complexity of the hashing function is O(1).

Using this hashing function lowers the probability of collisions. Therefore, we can achieve the expected complexity of Rabin-Karp in most cases. However, the algorithm would still be slow when there are many matches, regardless of the hashing function.

In the next section, we’ll discuss a non-deterministic version of Rabin-Karp.

5. Variations of Rabin-Karp

5.1. Non-Deterministic Version

The idea behind this version is to use multiple hashing functions, each with a different modulo value M and base B.

If two strings have equal hash values in multiple hashing functions, the probability of them being equal is so high that we can safely assume they’re equal in non-sensitive applications.

This allows us to rely completely on these hashing functions instead of having to check the strings character by character when the hash values are equal. Therefore, the complexity of this version is O(n+k), where n is the length of the text, and k is the length of the pattern.

5.2. Multiple Equal-Length Patterns

Rabin-Karp algorithm can be extended to deal with multiple patterns, as long as they have the same length.

First, we calculate the hash value for each pattern and store them in a map. Next, when checking some location in the text, we can check in constant time if the substring’s hash value is available in the map or not.

The expected complexity of this version, if applied to the deterministic version, is O(n+SumPatterns+SumMatches), where n is the length of the text, SumPatterns is the total length of all patterns, and SumMatches is the total length of all matches. On the other hand, if it’s applied to the non-deterministic version, the complexity would be O(n+SumPatterns).

6. Conclusion

In this article, we discussed the string searching problem.

First, we looked at the naive algorithm.

After that, we discussed how Rabin-Karp improves this algorithm by relying on hashing functions. Then, we explained how to choose good hashing functions.

Finally, we looked at some variations of Rabin-Karp like the non-deterministic version and applying Rabin-Karp on multiple patterns.

guest
0 Comments
Inline Feedbacks
View all comments