Knuth-Morris-Pratt Algorithm | Baeldung on Computer Science

1. Overview

In computer science, we have many string search algorithms. In this article, we’ll present the KMP (Knuth-Morris-Pratt) algorithm that searches for occurrences of a word $W$ inside a large text $T$ .

First, we’ll explain the naive search algorithm.

Next, we’ll explain the theoretical idea behind the KMP algorithm.

Finally, we’ll look at an example to better understand how it works.

2. Naive Search Algorithm

The naive search algorithm is pretty simple. It tries to match the word $W$ starting from every possible index inside the text $T$ . The following algorithm represents the naive approach to search for a word $W$ inside the text $T$ :

algorithm NaiveSearch(W, T, m, n):
    // INPUT
    //    W = The word we are searching for
    //    T = The full text
    //    m = Length of W
    //    n = Length of T
    // OUTPUT
    //    Returns true if W is found inside T, or false otherwise

    for i <- 0 to n - m:
        ok <- true
        for j <- 0 to m - 1:
            if W[j] != T[i + j]:
                ok <- false
                break
        if ok = true:
            return true
    return false

Suppose the word $W$ is $aab$ , and the text $T$ is $aaaab$ . Let’s take a look at the following table that shows the steps of the naive algorithm:

Step	Comparison					Result
i=0	a	a	a	a	b	mismatch
i=0	a	a	b			mismatch
i=1	a	a	a	a	b	mismatch
i=1		a	a	b		mismatch
i=2	a	a	a	a	b	match
i=2			a	a	b	match

As we can see, the algorithm checks all the possible locations for the word $W$ inside the text $T$ . However, in the second and third steps, the algorithm didn’t use the fact that the $aa$ part of the word $W$ has already been matched. Instead, the algorithm started matching the full word $W$ from the beginning.

It’s easy to see that the complexity of the naive approach is $O(n \cdot m)$ , where $n$ is the length of the text $T$ , and $m$ is the length of the word $W$ .

3. KMP Algorithm

The problem with the naive approach is that when it discovers a mismatch, it moves to the next position in the text $T$ , and starts comparing the word $W$ from the beginning. The KMP algorithm performs some analysis on the word $W$ before trying to find its occurrences in the text $T$ .

3.1. Definitions

Before jumping to the analysis performed on the word $W$ , let’s go through some basic definitions:

Prefix: A prefix of a string $S$ is any substring that starts from the beginning of $S$ and ends at any position of $S$ . In other words, it’s a substring that is formed by taking the characters in range $[0, i]$ , where $i$ represents all the possible indexes of string $S$ .
Suffix: A suffix of a string $S$ is any substring that starts from any position of $S$ and ends at the end of $S$ . In other words, it’s a substring that is formed by taking the characters in range $[i, n-1]$ , where $i$ represents all the possible indexes of string $S$ .
Proper Prefix: A proper prefix is any prefix that is not equal to the full string $S$ .
Proper Suffix: A proper suffix is any suffix that is not equal to the full string $S$ .

3.2. LPS Theoretical Idea

As we noticed, when the naive approach found a mismatch, it had to start from the beginning in the next step. To avoid this, the KMP algorithm performs some calculations on the word $W$ first, which is to calculate the LPS array. The term LPS refers to the Longest Proper Prefix that is also a Proper Suffix.

Let’s construct an array that stores, for each index of the word $W$ , the length of LPS (the use of this array will be explained in section 3.4). For easier use later, we will consider the word $W$ to be zero-based indexed, and store the LPS values inside the array starting from index 1. In other words, for each index $i \in [1, m]$ , we’ll store the LPS of the range $[0, i-1]$ .

Take a look at the following table that shows the LPS array of the word $aabaaba$ :

index	0	1	2	3	4	5	6	7
W		a	a	b	a	a	b	a
LPS	-1	0	1	0	1	2	3	4

As we can see, the empty string has LPS equal to -1 because there’s no proper prefix to consider. Also, index 3, which corresponds to the prefix $aab$ , has LPS equal to zero because none of its proper suffixes match any proper prefix. However, index 2, which corresponds to the prefix $aa$ , has LPS equal to one because the string $a$ is both a proper prefix and a proper suffix.

The same holds for index 7, which corresponds to the prefix $aabaaba$ . It has LPS equal to 4 because the string $aaba$ is the longest proper prefix that is also a proper suffix.

3.3. LPS Algorithm

The algorithm for calculating the LPS array might be a little tricky. Let’s take a look at its pseudocode:

algorithm CalculateLPS(W, m):
    // INPUT
    //    W = The word to calculate LPS array for
    //    m = Length of W
    // OUTPUT
    //    Returns the LPS array

    i <- 0
    j <- -1
    LSP <- an empty LSP array
    LPS[0] <- -1
    while i < m:
        while j >= 0 and W[i] != W[j]:
            j <- LPS[j]
        i <- i + 1
        j <- j + 1
        LPS[i] <- j
    return LPS

First, we initialize the base index with -1. Next, we iterate over the word $W$ . Suppose that the previous index had LPS equal to $j$ . The current index $i$ has two options. If the i^th character matches the j^th character, it will have LPS equal to $j+1$ (we assume $W$ is zero-based indexed).

However, if there’s no match, we’ll need to move $j$ back to the second-best LPS. The problem remains that we didn’t calculate the second-best LPS.

Since the last LPS was $j$ , it means that the first $j$ characters of $W$ match the last $j$ characters so far. Let’s go back to the first $j$ characters, and examine the LPS calculated for them. The LPS stored in the j^th position corresponds to the longest proper suffix, which is also a proper prefix. Therefore, this value is the second-best LPS.

After that, we compare the i^th character with the j^th character. If they match, we update the value of the i^th index in LPS array. Otherwise, we repeat this step until we either find a match or $j$ reaches the -1 value.

Although it looks like the algorithm has a square complexity, the time complexity is $O(m)$ , where $m$ is the length of $W$ . The reason behind this is that each value $i$ will cause $j$ to increase by one. No matter how much we move $j$ back, we will only be consuming the increased amount. Thus, the two nested while loops have a linear complexity.

3.4. KMP Search

KMP algorithm depends on the LPS array. Suppose we started matching the beginning part of the word $W$ with the text $T$ . At some point, we noticed a mismatch. Instead of starting matching from the beginning, we’ll use the calculated LPS array.

Suppose at some step we were on the i^th index inside the text $T$ and the j^th index inside the word $W$ . This means that currently, we were able to match the prefix of length $j$ from the word $W$ . If we find a mismatch, we need to find the second-longest matching prefix of the word $W$ , which is $LPS[j]$ .

If we continue to have a mismatch, we will always move $j$ back to $LPS[j]$ and try to match again. This step continues until we either find a match or $j$ reaches its initial -1 value.

When we find a match, we move $i$ to the next step. Also, we increase $j$ because the matched part length has increased by one.

Let’s take a look at the pseudocode of the described algorithm:

algorithm KMPSearch(W, T, m, n, LPS):
    // INPUT
    //    W = The word we are searching for
    //    T = The full text
    //    m = Length of W
    //    n = Length of T
    //    LPS = LPS array for word W
    // OUTPUT
    //    Returns true if W is found inside T, or false otherwise

    i <- 0
    j <- 0
    while i < n:
        while j >= 0 and T[i] != W[j]:
            j <- LPS[j]
        i <- i + 1
        j <- j + 1
        if j = m:
            return true
    return false

The complexity of the described algorithm is $O(n)$ , where $n$ is the length of the text $T$ . The reason for this complexity is similar to when we calculated the $LPS$ array. Since $j$ is increased once for each $i$ , it means that no matter how much we moved $j$ back, it’s only consuming the increments done before for each $i$ . Therefore, $j$ will move back a total of $n$ steps at the worst-case.

The total time complexity for the KMP algorithm is $O(n+m)$ , where $n$ is the length of the text $T$ , and $m$ is the length of the word $W$ . This is because, for each KMP search, we first calculate the LPS array, and then perform the KMP search process.

4. Example

Let’s quickly apply the KMP search algorithm to a small example. Suppose the text $T$ equal to $aaaab$ and the word $W$ equal to $aab$ . First, let’s take a look at the LPS table of the word $W$ .

W		a	a	b
LPS	-1	0	1	0

Now, take a look at the result of applying the KMP search algorithm to the given text.

Step	Value of j	Match					Longest Match
Step	Value of j	a	a	a	a	b	Longest Match
i = 0	j = 0	a					j = 1
i = 1	j = 1	a	a				j = 2
i = 2	Mismatch! Move j one step: j = 1		a	a			j = 2
i = 3	Mismatch! Move j one step: j = 1			a	a		j = 2
i = 4	j = 2			a	a	b	j = 3

As we can see, in the third and fourth steps, the algorithm used the $LPS$ array to set $j$ to its correct position. Therefore, the algorithm didn’t need to start matching the word $W$ from the beginning when a mismatch was detected. This helped the algorithm to efficiently find the occurrence of the word $W$ in the last step.

5. Conclusion

In this article, we presented the KMP search algorithm. First, we took a quick look at the naive algorithm. Next, we moved to discuss the KMP search algorithm itself. We explained the LPS array that is later used for the KMP search.

Finally, we provided an example of applying the KMP algorithm to search for a given word inside the text.

Learn Java Collections

Learn Spring

Learn Maven

View All Courses

Core Concepts

Operating Systems

Neural Networks

Graph Theory

Latex

Full Archive

About Baeldung