In computer science, string searching means finding the location of one or more strings (called patterns) in a large text.
In this article, we’ll discuss the Rabin-Karp string searching algorithm. First, we’ll explain the naive string searching algorithm. Next, we’ll explain the Rabin-Karp algorithm and apply it to an example.
Finally, we’ll discuss some variations of this algorithm.
2. Naive Algorithm
The idea of this algorithm is simple. It works by trying to search for the pattern starting from every possible location in the text.
Consider the following algorithm that represents the naive approach:
We try to match the pattern , starting from every possible position in . For each starting position, we keep iterating over both the pattern and the text trying to match them.
If we reach a match, we increase the answer by one. Otherwise, we break the loop on the first mismatch we find.
As an example, let’s look at the algorithm’s steps if we wanted to look for in :
The Complexity of this algorithm is , where is the length of the text, and is the length of the pattern.
3. Rabin-Karp Algorithm
First, we need to have some background on hashing functions that are used in Rabin-Karp algorithms. After that, we’ll move to the Rabin-Karp algorithm itself.
3.1. Hash Function
Previously, we had to try matching the pattern from every single position in the text. To improve this, we’ll create a function that can map any string to an integer value.
In other words, we need a function (called a hashing function), which takes a string and maps it to an integer . From now on, we’ll call the hash value of .
However, not every function is useful. The hashing function should be easy to calculate. We should also be able to perform precalculation, and in the precalculation, we should be able to calculate the hash value for a string in linear time.
Moreover, after the preprocessing, we should be able to calculate the hash value of any substring in constant time.
3.2. Hash Function Example
Perhaps one of the simplest hashing functions we can use is the sum of ASCII codes of the letters in the string. For simplicity, we’ll use to denote the ASCII code of . Formally:
Since we can quickly calculate the hash value of a substring using prefix sums, this hashing function is a valid choice.
Let’s look at a few examples, assuming :
Notice that even though . This case, when two distinct strings have the same hash value, is called a collision.
As a rule, the fewer collisions a hashing function causes, the better it is. Our hashing function has lots of collisions since it doesn’t take into account the order and position of letters. Later, we’ll discuss better hashing functions.
Check the following implementation of a hash structure using the discussed hashing function:
The first function performs the precalculation on the given string. We iterate over the string and calculate the prefix sum of the ASCII values of the string’s characters.
The second function is used to calculate the hash value of a given range . To do this, we return the prefix sum at the end of the range, after subtracting the prefix sum of the beginning of the range. Therefore, we get the sum of values in the range .
The complexity of the precalculation function is , where is the length of the string. Also, the complexity of the hashing calculation function is .
3.3. Rabin-Karp Main Idea
Earlier, we explained what a hashing function is. The most important property of hashing functions is that if we have two strings and and , we can safely assume that .
However, if , we can’t guarantee that because it’s possible that two distinct strings have the same hash value.
Now, let’s take a look at the Rabin-Karp string searching algorithm:
Assume that we’re looking for the pattern in the text . We’re currently at location in , and we need to check if is equal to , where is equal to the length of .
If then we can assume that the two strings aren’t equal, and skip the matching attempt at this location. Otherwise, we should attempt to match the strings character by character.
The worst-case complexity of this algorithm is still , where is the length of the text, and is the length of the pattern.
However, if we use a good hashing function, the expected complexity would become , where is the number of matches.
The reason behind this is that a good hashing function would rarely cause collisions. Therefore, we’d rarely need to compare two substrings when there is no match. Furthermore, if we only want to check if the pattern exists or not, the complexity would become , because we can break after the first occurrence.
3.4. Rabin-Karp Example
To understand how Rabin-Karp works, let’s revisit the example we used in section 2.2. Unlike the naive algorithm, we do not need to match the pattern at every location.
Let’s take a look at the matching process:
In comparison to the naive algorithm, we only needed to check 2 out of 5 positions.
In the next section, we’ll discuss a better hashing function that can get rid of most of the false-positives. Thus, it’ll reduce the number of locations that we need to check.
4. A Good Hashing Function
Earlier, we discussed a hashing function based on the sum of ASCII codes. However, that hashing function didn’t take into account the order or the positions of the letters. For example, any permutation of the string will have the same hash value.
Instead, we’ll try to represent strings as numbers in base , where is a prime number larger than all ASCII codes.
As an example, assume we have the string . Its representation in base would be:
In general, a string with length would have the following hash value:
Since we’re representing the string in a valid number system, hash values for distinct strings will be distinct. However, the hash values would become huge when representing a long string. Therefore, it would be inefficient to compute and store them.
Instead, we’ll calculate the hash value modulo , where is a large prime (usually around ). The larger the prime is, the fewer collisions it would cause.
As discussed earlier, we should be able to calculate the hash value for any substring in constant time after preprocessing.
Let’s try to calculate the hash value for every prefix in a string of length :
Notice that we have . This will allow us to calculate the prefix array in linear time. Also, we’d need to calculate an array that stores the different powers of the prime .
However, we still need to find a way to calculate the hash value of any substring in constant time.
As an example, let’s say we want to calculate the value of . The result should be .
This value is available in , but we need to remove the terms corresponding to and . These two terms are available in . However, when moving from to , they were multiplied times by .
To remove them, we need to multiply by and then subtract it from . Therefore, .
As a rule, .
Now that we have these equations ready, let’s implement the following hashing structure:
The complexity of the precalculation is , where is the length of the string. Also, the complexity of the hashing function is .
Using this hashing function lowers the probability of collisions. Therefore, we can achieve the expected complexity of Rabin-Karp in most cases. However, the algorithm would still be slow when there are many matches, regardless of the hashing function.
In the next section, we’ll discuss a non-deterministic version of Rabin-Karp.
5. Variations of Rabin-Karp
5.1. Non-Deterministic Version
The idea behind this version is to use multiple hashing functions, each with a different modulo value and base .
If two strings have equal hash values in multiple hashing functions, the probability of them being equal is so high that we can safely assume they’re equal in non-sensitive applications.
This allows us to rely completely on these hashing functions instead of having to check the strings character by character when the hash values are equal. Therefore, the complexity of this version is , where is the length of the text, and is the length of the pattern.
5.2. Multiple Equal-Length Patterns
Rabin-Karp algorithm can be extended to deal with multiple patterns, as long as they have the same length.
First, we calculate the hash value for each pattern and store them in a map. Next, when checking some location in the text, we can check in constant time if the substring’s hash value is available in the map or not.
The expected complexity of this version, if applied to the deterministic version, is , where is the length of the text, is the total length of all patterns, and is the total length of all matches. On the other hand, if it’s applied to the non-deterministic version, the complexity would be .
In this article, we discussed the string searching problem.
First, we looked at the naive algorithm.
After that, we discussed how Rabin-Karp improves this algorithm by relying on hashing functions. Then, we explained how to choose good hashing functions.
Finally, we looked at some variations of Rabin-Karp like the non-deterministic version and applying Rabin-Karp on multiple patterns.