1. Overview

In computer science, we have many string search algorithms. In this article, we’ll present the KMP (Knuth-Morris-Pratt) algorithm that searches for occurrences of a word W inside a large text T.

First, we’ll explain the naive search algorithm.

Next, we’ll explain the theoretical idea behind the KMP algorithm.

Finally, we’ll look at an example to better understand how it works.

2. Naive Search Algorithm

The naive search algorithm is pretty simple. It tries to match the word W starting from every possible index inside the text T. The following algorithm represents the naive approach to search for a word W inside the text T:

Rendered by QuickLaTeX.com

Suppose the word W is aab, and the text T is aaaab. Let’s take a look at the following table that shows the steps of the naive algorithm:

Rendered by QuickLaTeX.com

As we can see, the algorithm checks all the possible locations for the word W inside the text T. However, in the second and third steps, the algorithm didn’t use the fact that the aa part of the word W has already been matched. Instead, the algorithm started matching the full word W from the beginning.

It’s easy to see that the complexity of the naive approach is O(n \cdot m), where n is the length of the text T, and m is the length of the word W.

3. KMP Algorithm

The problem with the naive approach is that when it discovers a mismatch, it moves to the next position in the text T, and starts comparing the word W from the beginning. The KMP algorithm performs some analysis on the word W before trying to find its occurrences in the text T.

3.1. Definitions

Before jumping to the analysis performed on the word W, let’s go through some basic definitions:

  1. Prefix: A prefix of a string S is any substring that starts from the beginning of S and ends at any position of S. In other words, it’s a substring that is formed by taking the characters in range [0, i], where i represents all the possible indexes of string S.
  2. Suffix: A suffix of a string S is any substring that starts from any position of S and ends at the end of S. In other words, it’s a substring that is formed by taking the characters in range [i, n-1], where i represents all the possible indexes of string S.
  3. Proper Prefix: A proper prefix is any prefix that is not equal to the full string S.
  4. Proper Suffix: A proper suffix is any suffix that is not equal to the full string S.

3.2. LPS Theoretical Idea

As we noticed, when the naive approach found a mismatch, it had to start from the beginning in the next step. To avoid this, the KMP algorithm performs some calculations on the word W first, which is to calculate the LPS array. The term LPS refers to the Longest Proper Prefix that is also a Proper Suffix.

Let’s construct an array that stores, for each index of the word W, the length of LPS (the use of this array will be explained in section 3.4). For easier use later, we will consider the word W to be zero-based indexed, and store the LPS values inside the array starting from index 1. In other words, for each index i \in [1, m], we’ll store the LPS of the range [0, i-1].

Take a look at the following table that shows the LPS array of the word aabaaba:

Rendered by QuickLaTeX.com

As we can see, the empty string has LPS equal to -1 because there’s no proper prefix to consider. Also, index 3, which corresponds to the prefix aab, has LPS equal to zero because none of its proper suffixes match any proper prefix. However, index 2, which corresponds to the prefix aa, has LPS equal to one because the string a is both a proper prefix and a proper suffix.

The same holds for index 7, which corresponds to the prefix aabaaba. It has LPS equal to 4 because the string aaba is the longest proper prefix that is also a proper suffix.

3.3. LPS Algorithm

The algorithm for calculating the LPS array might be a little tricky. Let’s take a look at its pseudocode:

Rendered by QuickLaTeX.com

First, we initialize the base index with -1. Next, we iterate over the word W. Suppose that the previous index had LPS equal to j. The current index i has two options. If the ith character matches the jth character, it will have LPS equal to j+1 (we assume W is zero-based indexed).

However, if there’s no match, we’ll need to move j back to the second-best LPS. The problem remains that we didn’t calculate the second-best LPS.

Since the last LPS was j, it means that the first j characters of W match the last j characters so far. Let’s go back to the first j characters, and examine the LPS calculated for them. The LPS stored in the jth position corresponds to the longest proper suffix, which is also a proper prefix. Therefore, this value is the second-best LPS.

After that, we compare the ith character with the jth character. If they match, we update the value of the ith index in LPS array. Otherwise, we repeat this step until we either find a match or j reaches the -1 value.

Although it looks like the algorithm has a square complexity, the time complexity is O(m), where m is the length of W. The reason behind this is that each value i will cause j to increase by one. No matter how much we move j back, we will only be consuming the increased amount. Thus, the two nested while loops have a linear complexity.

KMP algorithm depends on the LPS array. Suppose we started matching the beginning part of the word W with the text T. At some point, we noticed a mismatch. Instead of starting matching from the beginning, we’ll use the calculated LPS array.

Suppose at some step we were on the ith index inside the text T and the jth index inside the word W. This means that currently, we were able to match the prefix of length j from the word W. If we find a mismatch, we need to find the second-longest matching prefix of the word W, which is LPS[j].

If we continue to have a mismatch, we will always move j back to LPS[j] and try to match again. This step continues until we either find a match or j reaches its initial -1 value.

When we find a match, we move i to the next step. Also, we increase j because the matched part length has increased by one.

Let’s take a look at the pseudocode of the described algorithm:

Rendered by QuickLaTeX.com

The complexity of the described algorithm is O(n), where n is the length of the text T. The reason for this complexity is similar to when we calculated the LPS array. Since j is increased once for each i, it means that no matter how much we moved j back, it’s only consuming the increments done before for each i. Therefore, j will move back a total of n steps at the worst-case.

The total time complexity for the KMP algorithm is O(n+m), where n is the length of the text T, and m is the length of the word W. This is because, for each KMP search, we first calculate the LPS array, and then perform the KMP search process.

4. Example

Let’s quickly apply the KMP search algorithm to a small example. Suppose the text T equal to aaaab and the word W equal to aab. First, let’s take a look at the LPS table of the word W.

Rendered by QuickLaTeX.com

Now, take a look at the result of applying the KMP search algorithm to the given text.

Rendered by QuickLaTeX.com

As we can see, in the third and fourth steps, the algorithm used the LPS array to set j to its correct position. Therefore, the algorithm didn’t need to start matching the word W from the beginning when a mismatch was detected. This helped the algorithm to efficiently find the occurrence of the word W in the last step.

5. Conclusion

In this article, we presented the KMP search algorithm. First, we took a quick look at the naive algorithm. Next, we moved to discuss the KMP search algorithm itself. We explained the LPS array that is later used for the KMP search.

Finally, we provided an example of applying the KMP algorithm to search for a given word inside the text.

Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.