In computer science, we have many string search algorithms. In this article, we’ll present the KMP (Knuth-Morris-Pratt) algorithm that searches for occurrences of a word inside a large text .
First, we’ll explain the naive search algorithm.
Next, we’ll explain the theoretical idea behind the KMP algorithm.
Finally, we’ll look at an example to better understand how it works.
2. Naive Search Algorithm
The naive search algorithm is pretty simple. It tries to match the word starting from every possible index inside the text . The following algorithm represents the naive approach to search for a word inside the text :
Suppose the word is , and the text is . Let’s take a look at the following table that shows the steps of the naive algorithm:
As we can see, the algorithm checks all the possible locations for the word inside the text . However, in the second and third steps, the algorithm didn’t use the fact that the part of the word has already been matched. Instead, the algorithm started matching the full word from the beginning.
It’s easy to see that the complexity of the naive approach is , where is the length of the text , and is the length of the word .
3. KMP Algorithm
The problem with the naive approach is that when it discovers a mismatch, it moves to the next position in the text , and starts comparing the word from the beginning. The KMP algorithm performs some analysis on the word before trying to find its occurrences in the text .
Before jumping to the analysis performed on the word , let’s go through some basic definitions:
- Prefix: A prefix of a string is any substring that starts from the beginning of and ends at any position of . In other words, it’s a substring that is formed by taking the characters in range , where represents all the possible indexes of string .
- Suffix: A suffix of a string is any substring that starts from any position of and ends at the end of . In other words, it’s a substring that is formed by taking the characters in range , where represents all the possible indexes of string .
- Proper Prefix: A proper prefix is any prefix that is not equal to the full string .
- Proper Suffix: A proper suffix is any suffix that is not equal to the full string .
3.2. LPS Theoretical Idea
As we noticed, when the naive approach found a mismatch, it had to start from the beginning in the next step. To avoid this, the KMP algorithm performs some calculations on the word first, which is to calculate the LPS array. The term LPS refers to the Longest Proper Prefix that is also a Proper Suffix.
Let’s construct an array that stores, for each index of the word , the length of LPS (the use of this array will be explained in section 3.4). For easier use later, we will consider the word to be zero-based indexed, and store the LPS values inside the array starting from index 1. In other words, for each index , we’ll store the LPS of the range .
Take a look at the following table that shows the LPS array of the word :
As we can see, the empty string has LPS equal to -1 because there’s no proper prefix to consider. Also, index 3, which corresponds to the prefix , has LPS equal to zero because none of its proper suffixes match any proper prefix. However, index 2, which corresponds to the prefix , has LPS equal to one because the string is both a proper prefix and a proper suffix.
The same holds for index 7, which corresponds to the prefix . It has LPS equal to 4 because the string is the longest proper prefix that is also a proper suffix.
3.3. LPS Algorithm
The algorithm for calculating the LPS array might be a little tricky. Let’s take a look at its pseudocode:
First, we initialize the base index with -1. Next, we iterate over the word . Suppose that the previous index had LPS equal to . The current index has two options. If the ith character matches the jth character, it will have LPS equal to (we assume is zero-based indexed).
However, if there’s no match, we’ll need to move back to the second-best LPS. The problem remains that we didn’t calculate the second-best LPS.
Since the last LPS was , it means that the first characters of match the last characters so far. Let’s go back to the first characters, and examine the LPS calculated for them. The LPS stored in the jth position corresponds to the longest proper suffix, which is also a proper prefix. Therefore, this value is the second-best LPS.
After that, we compare the ith character with the jth character. If they match, we update the value of the ith index in LPS array. Otherwise, we repeat this step until we either find a match or reaches the -1 value.
Although it looks like the algorithm has a square complexity, the time complexity is , where is the length of . The reason behind this is that each value will cause to increase by one. No matter how much we move back, we will only be consuming the increased amount. Thus, the two nested while loops have a linear complexity.
3.4. KMP Search
KMP algorithm depends on the LPS array. Suppose we started matching the beginning part of the word with the text . At some point, we noticed a mismatch. Instead of starting matching from the beginning, we’ll use the calculated LPS array.
Suppose at some step we were on the ith index inside the text and the jth index inside the word . This means that currently, we were able to match the prefix of length from the word . If we find a mismatch, we need to find the second-longest matching prefix of the word , which is .
If we continue to have a mismatch, we will always move back to and try to match again. This step continues until we either find a match or reaches its initial -1 value.
When we find a match, we move to the next step. Also, we increase because the matched part length has increased by one.
Let’s take a look at the pseudocode of the described algorithm:
The complexity of the described algorithm is , where is the length of the text . The reason for this complexity is similar to when we calculated the array. Since is increased once for each , it means that no matter how much we moved back, it’s only consuming the increments done before for each . Therefore, will move back a total of steps at the worst-case.
The total time complexity for the KMP algorithm is , where is the length of the text , and is the length of the word . This is because, for each KMP search, we first calculate the LPS array, and then perform the KMP search process.
Let’s quickly apply the KMP search algorithm to a small example. Suppose the text equal to and the word equal to . First, let’s take a look at the LPS table of the word .
Now, take a look at the result of applying the KMP search algorithm to the given text.
As we can see, in the third and fourth steps, the algorithm used the array to set to its correct position. Therefore, the algorithm didn’t need to start matching the word from the beginning when a mismatch was detected. This helped the algorithm to efficiently find the occurrence of the word in the last step.
In this article, we presented the KMP search algorithm. First, we took a quick look at the naive algorithm. Next, we moved to discuss the KMP search algorithm itself. We explained the LPS array that is later used for the KMP search.
Finally, we provided an example of applying the KMP algorithm to search for a given word inside the text.