Overview of Rabin-Karp Algorithm | Baeldung on Computer Science

1. Overview

In computer science, string searching means finding the location of one or more strings (called patterns) in a large text.

In this article, we’ll discuss the Rabin-Karp string searching algorithm. First, we’ll explain the naive string searching algorithm. Next, we’ll explain the Rabin-Karp algorithm and apply it to an example.

Finally, we’ll discuss some variations of this algorithm.

2. Naive Algorithm

2.1. Pseudocode

The idea of this algorithm is simple. It works by trying to search for the pattern starting from every possible location in the text.

Consider the following algorithm that represents the naive approach:

algorithm NaiveStringSearch(P, T):
    // INPUT
    //    P = The pattern to look for
    //    T = The text to look in
    // OUTPUT
    //    Returns the number of occurrences of P in T

    answer <- 0
    n <- length(T)
    k <- length(P)
    for i <- 0 to n - k:
        valid <- true
        for j <- 0 to k - 1:
            if T[i + j] != P[j]:
                valid <- false
                break
        if valid = true:
            answer <- answer + 1
    return answer

We try to match the pattern $P$ , starting from every possible position in $T$ . For each starting position, we keep iterating over both the pattern $P$ and the text $T$ trying to match them.

If we reach a match, we increase the answer by one. Otherwise, we break the loop on the first mismatch we find.

2.2. Example

As an example, let’s look at the algorithm’s steps if we wanted to look for $P="bab"$ in $T="ababaac"$ :

Location	Comparison							Matching Result
0	a	b	a	b	a	a	c	Mismatch
0	b	a	b					Mismatch
1	a	b	a	b	a	a	c	Match
1		b	a	b				Match
2	a	b	a	b	a	a	c	Mismatch
2			b	a	b			Mismatch
3	a	b	a	b	a	a	c	Mismatch
3				b	a	b		Mismatch
4	a	b	a	b	a	a	c	Mismatch
4					b	a	b	Mismatch

The Complexity of this algorithm is $O(n \cdot k)$ , where $n$ is the length of the text, and $k$ is the length of the pattern.

3. Rabin-Karp Algorithm

First, we need to have some background on hashing functions that are used in Rabin-Karp algorithms. After that, we’ll move to the Rabin-Karp algorithm itself.

3.1. Hash Function

Previously, we had to try matching the pattern from every single position in the text. To improve this, we’ll create a function that can map any string to an integer value.

In other words, we need a function $H$ (called a hashing function), which takes a string $s$ and maps it to an integer $H(s)=x$ . From now on, we’ll call $x$ the hash value of $s$ .

However, not every function is useful. The hashing function should be easy to calculate. We should also be able to perform precalculation, and in the precalculation, we should be able to calculate the hash value for a string in linear time.

Moreover, after the preprocessing, we should be able to calculate the hash value of any substring $s[i...j]$ in constant time.

3.2. Hash Function Example

Perhaps one of the simplest hashing functions we can use is the sum of ASCII codes of the letters in the string. For simplicity, we’ll use $int(x)$ to denote the ASCII code of $x$ . Formally:

$H(s[0...n-1])=int(s[0])+int(s[1])+....+int(s[n-1])$

Since we can quickly calculate the hash value of a substring using prefix sums, this hashing function is a valid choice.

Let’s look at a few examples, assuming $s="aabbab"$ :

$H(s[2...4]) = H("bba") = 98 + 98 + 97 = 293$
$H(s[3...5]) = H("bab") = 98 + 97 + 98 = 293$

Notice that $H(s[2...4])=H(s[3...5])$ even though $s[2...4] \neq s[3...5]$ . This case, when two distinct strings have the same hash value, is called a collision.

As a rule, the fewer collisions a hashing function causes, the better it is. Our hashing function has lots of collisions since it doesn’t take into account the order and position of letters. Later, we’ll discuss better hashing functions.

Check the following implementation of a hash structure using the discussed hashing function:

algorithm HashStructureInit(s):
    // INPUT
    //    s = String for which we want to calculate the hash values
    // OUTPUT
    //    Initializes prefix sum of ASCII codes in string s

    n <- length(s)
    prefix[0] <- int(s[0])
    for i <- 1 to n - 1:
        prefix[i] <- prefix[i - 1] + int(s[i])

algorithm getHash(L, R):
    // INPUT
    //    L, R = The indices determining the substring s[L...R] to calculate the hash value for
    // OUTPUT
    //    Returns the hash value of substring s[L...R]

    if L = 0:
        return prefix[R]
    return prefix[R] - prefix[L - 1]

The first function $Init$ performs the precalculation on the given string. We iterate over the string $s$ and calculate the prefix sum of the ASCII values of the string’s characters.

The second function $getHash$ is used to calculate the hash value of a given range $[L, R]$ . To do this, we return the prefix sum at the end of the range, after subtracting the prefix sum of the beginning of the range. Therefore, we get the sum of values in the range $[L, R]$ .

The complexity of the precalculation function is $O(n)$ , where $n$ is the length of the string. Also, the complexity of the hashing calculation function is $O(1)$ .

3.3. Rabin-Karp Main Idea

Earlier, we explained what a hashing function is. The most important property of hashing functions is that if we have two strings $s_1$ and $s_2$ and $H(s_1) \neq H(s_2)$ , we can safely assume that $s_1 \neq s_2$ .

However, if $H(s_1) = H(s_2)$ , we can’t guarantee that $s_1 = s_2$ because it’s possible that two distinct strings have the same hash value.

Now, let’s take a look at the Rabin-Karp string searching algorithm:

algorithm RabinKarpSearch(P, T, hashT, hashP):
    // INPUT
    //    P = The pattern to look for
    //    T = The text to look in
    //    hashT, hashP = Hash structures for strings T, P
    // OUTPUT
    //    Returns the number of occurrences of P in T

    answer <- 0
    n <- length(T)
    k <- length(P)
    for i <- 0 to n - k:
        textHash <- hashT.getHash(i, i + k - 1)
        patternHash <- hashP.getHash(0, k - 1)
        if textHash = patternHash:
            valid <- true
            for j <- 0 to k - 1:
                if T[i + j] != P[j]:
                    valid <- false
                    break
            if valid = true:
                answer <- answer + 1
    return answer

Assume that we’re looking for the pattern $P$ in the text $T$ . We’re currently at location $i$ in $T$ , and we need to check if $T[i...i+k-1]$ is equal to $P[0...k-1]$ , where $k$ is equal to the length of $P$ .

If $H(P[0...m-1]) \neq H(T[i...i+m-1])$ then we can assume that the two strings aren’t equal, and skip the matching attempt at this location. Otherwise, we should attempt to match the strings character by character.

The worst-case complexity of this algorithm is still $O(n \cdot k)$ , where $n$ is the length of the text, and $k$ is the length of the pattern.

However, if we use a good hashing function, the expected complexity would become $O(n + k \cdot t)$ , where $t$ is the number of matches.

The reason behind this is that a good hashing function would rarely cause collisions. Therefore, we’d rarely need to compare two substrings when there is no match. Furthermore, if we only want to check if the pattern exists or not, the complexity would become $O(n+k)$ , because we can break after the first occurrence.

3.4. Rabin-Karp Example

To understand how Rabin-Karp works, let’s revisit the example we used in section 2.2. Unlike the naive algorithm, we do not need to match the pattern at every location.

Let’s take a look at the matching process:

Location	Comparison							Hashing Values	Check Needed?	Result
0	a	b	a	b	a	a	c	H(“aba”) = 292	No
0	b	a	b					H(“bab”) = 293	No
1	a	b	a	b	a	a	c	H(“bab”) = 293	Yes	Match
1		b	a	b				H(“bab”) = 293	Yes	Match
2	a	b	a	b	a	a	c	H(“aba”) = 292	No
2			b	a	b			H(“bab”) = 293	No
3	a	b	a	b	a	a	c	H(“baa”) = 292	No
3				b	a	b		H(“bab”) = 293	No
4	a	b	a	b	a	a	c	H(“aac”) = 293	Yes	Mismatch
4					b	a	b	H(“bab”) = 293	Yes	Mismatch

In comparison to the naive algorithm, we only needed to check 2 out of 5 positions.

In the next section, we’ll discuss a better hashing function that can get rid of most of the false-positives. Thus, it’ll reduce the number of locations that we need to check.

4. A Good Hashing Function

4.1. Definition

Earlier, we discussed a hashing function based on the sum of ASCII codes. However, that hashing function didn’t take into account the order or the positions of the letters. For example, any permutation of the string will have the same hash value.

Instead, we’ll try to represent strings as numbers in base $B$ , where $B$ is a prime number larger than all ASCII codes.

As an example, assume we have the string $s="abac"$ . Its representation in base $B$ would be:

$int('a') \times B^3 + int('b') \times B^2 + int('a') \times B + int('c')$

In general, a string $s$ with length $n$ would have the following hash value:

$int(s[0])\times B^{n-1} + int(s[1]) \times B^{n-2} + ... + int(s[n-2]) \times B + int(s[n-1])$

Since we’re representing the string in a valid number system, hash values for distinct strings will be distinct. However, the hash values would become huge when representing a long string. Therefore, it would be inefficient to compute and store them.

Instead, we’ll calculate the hash value modulo $m$ , where $m$ is a large prime (usually around $10^9$ ). The larger the prime is, the fewer collisions it would cause.

4.2. Implementation

As discussed earlier, we should be able to calculate the hash value for any substring in constant time after preprocessing.

Let’s try to calculate the hash value for every prefix in a string $s$ of length $5$ :

Index	Prefix	Hash Value
$0$	$s[0]$	$int(s[0])$
$1$	$s[0...1]$	$int(s[0]) \times B + int(s[1])$
$2$	$s[0...2]$	$int(s[0]) \times B^2 + int(s[1]) \times B + int(s[2])$
$3$	$s[0...3]$	$int(s[0]) \times B^3 + int(s[1]) \times B^2 + int(s[2]) \times B + int(s[3])$
$4$	$s[0...4]$	$int(s[0]) \times B^4 + int(s[1]) \times B^3 + int(s[2]) \times B^2 + int(s[3]) \times B + int(s[4])$

Notice that we have $prefix[i] = prefix[i-1] \times B + int(s[i])$ . This will allow us to calculate the prefix array in linear time. Also, we’d need to calculate an array $Pow$ that stores the different powers of the prime $B$ .

However, we still need to find a way to calculate the hash value of any substring in constant time.

As an example, let’s say we want to calculate the value of $H(s[2...4])$ . The result should be $int(s[2]) \times B^2 + int(s[3]) \times B + int(s[4])$ .

This value is available in $prefix[4]$ , but we need to remove the terms corresponding to $s[0]$ and $s[1]$ . These two terms are available in $prefix[1]$ . However, when moving from $prefix[1]$ to $prefix[4]$ , they were multiplied $4-1=3$ times by $B$ .

To remove them, we need to multiply $prefix[1]$ by $B^3$ and then subtract it from $prefix[4]$ . Therefore, $H(s[2...4]) = prefix[4] - prefix[1] \times B^{4-1}$ .

As a rule, $H(s[L...R]) = prefix[R] - prefix[L-1] \times B^{R-L+1}$ .

Now that we have these equations ready, let’s implement the following hashing structure:

algorithm ImprovedHashStructureInit(s, B, m):
    // INPUT
    //    s = String for which we want to calculate hash values
    //    B = The base used in the polynomial rolling hash function, a prime number
    //    m = The modulo value, a large prime number, used to avoid large integers
    // OUTPUT
    //    Prepares the hash structure with precomputed prefix hash values and powers of B modulo m

    n <- length(s)
    
    Pre <- make an empty array
    Pow <- make an empty array
    // Pre will be the array of precomputed prefix hash values for the string s
    // Pow will be the array storing powers of B modulo m
    
    Pre[0] <- int(s[0])
    Pow[0] <- 1
    
    for i <- 1 to n - 1:
        Pre[i] <- (Pre[i - 1] * B + int(s[i])) % m
        Pow[i] <- (Pow[i - 1] * B) % m 

algorithm getHash(L, R, m):
    // INPUT
    //    L, R = Indices determining the substring s[L...R] to calculate the hash value for
    //    m = The modulo value for hash calculation
    // OUTPUT
    //    Returns the hash value of substring s[L...R], computed modulo m

    if L = 0:
        return Pre[R]
    return (Pre[R] - (Pre[L - 1] * Pow[R - L + 1]) % m + m) % m

The complexity of the precalculation is $O(n)$ , where $n$ is the length of the string. Also, the complexity of the hashing function is $O(1)$ .

Using this hashing function lowers the probability of collisions. Therefore, we can achieve the expected complexity of Rabin-Karp in most cases. However, the algorithm would still be slow when there are many matches, regardless of the hashing function.

In the next section, we’ll discuss a non-deterministic version of Rabin-Karp.

5. Variations of Rabin-Karp

5.1. Non-Deterministic Version

The idea behind this version is to use multiple hashing functions, each with a different modulo value $M$ and base $B$ .

If two strings have equal hash values in multiple hashing functions, the probability of them being equal is so high that we can safely assume they’re equal in non-sensitive applications.

This allows us to rely completely on these hashing functions instead of having to check the strings character by character when the hash values are equal. Therefore, the complexity of this version is $O(n+k)$ , where $n$ is the length of the text, and $k$ is the length of the pattern.

5.2. Multiple Equal-Length Patterns

Rabin-Karp algorithm can be extended to deal with multiple patterns, as long as they have the same length.

First, we calculate the hash value for each pattern and store them in a map. Next, when checking some location in the text, we can check in constant time if the substring’s hash value is available in the map or not.

The expected complexity of this version, if applied to the deterministic version, is $O(n+SumPatterns+SumMatches)$ , where $n$ is the length of the text, $SumPatterns$ is the total length of all patterns, and $SumMatches$ is the total length of all matches. On the other hand, if it’s applied to the non-deterministic version, the complexity would be $O(n+SumPatterns)$ .

6. Conclusion

In this article, we discussed the string searching problem.

First, we looked at the naive algorithm.

After that, we discussed how Rabin-Karp improves this algorithm by relying on hashing functions. Then, we explained how to choose good hashing functions.

Finally, we looked at some variations of Rabin-Karp like the non-deterministic version and applying Rabin-Karp on multiple patterns.

Core Concepts

Operating Systems

Neural Networks

Graph Theory

Latex

Full Archive

About Baeldung