String Search Algorithms for Large Texts

Azure Spring Apps is a fully managed service from Microsoft (built in collaboration with VMware), focused on building and deploying Spring Boot applications on Azure Cloud without worrying about Kubernetes.

And, the Enterprise plan comes with some interesting features, such as commercial Spring runtime support, a 99.95% SLA and some deep discounts (up to 47%) when you are ready for production.

>> Learn more and deploy your first Spring Boot app to Azure.

You can also ask questions and leave feedback on the Azure Spring Apps GitHub page.

Slow MySQL query performance is all too common. Of course it is. A good way to go is, naturally, a dedicated profiler that actually understands the ins and outs of MySQL.

The Jet Profiler was built for MySQL only, so it can do things like real-time query performance, focus on most used tables or most frequent queries, quickly identify performance issues and basically help you optimize your queries.

Critically, it has very minimal impact on your server's performance, with most of the profiling work done separately - so it needs no server changes, agents or separate services.

Basically, you install the desktop application, connect to your MySQL server, hit the record button, and you'll have results within minutes:

>> Try out the Profiler

Accelerate Your Jakarta EE Development with Payara Server!

With best-in-class guides and documentation, Payara essentially simplifies deployment to diverse infrastructures.

Beyond that, it provides intelligent insights and actions to optimize Jakarta EE applications.

The goal is to apply an opinionated approach to get to what's essential for mission-critical applications - really solid scalability, availability, security, and long-term support:

>> Download and Explore the Guide (to learn more)

The AI Assistant to boost Boost your productivity writing unit tests - Machinet AI.

AI is all the rage these days, but for very good reason. The highly practical coding companion, you'll get the power of AI-assisted coding and automated unit test generation.
Machinet's Unit Test AI Agent utilizes your own project context to create meaningful unit tests that intelligently aligns with the behavior of the code.
And, the AI Chat crafts code and fixes errors with ease, like a helpful sidekick.

Simplify Your Coding Journey with Machinet AI:

>> Install Machinet AI in your IntelliJ

Looking for the ideal Linux distro for running modern Spring apps in the cloud?

Meet Alpaquita Linux: lightweight, secure, and powerful enough to handle heavy workloads.

This distro is specifically designed for running Java apps. It builds upon Alpine and features significant enhancements to excel in high-density container environments while meeting enterprise-grade security standards.

Specifically, the container image size is ~30% smaller than standard options, and it consumes up to 30% less RAM:

>> Try Alpaquita Containers now.

DbSchema is a super-flexible database designer, which can take you from designing the DB with your team all the way to safely deploying the schema.

The way it does all of that is by using a design model, a database-independent image of the schema, which can be shared in a team using GIT and compared or deployed on to any database.

And, of course, it can be heavily visual, allowing you to interact with the database using diagrams, visually compose queries, explore the data, generate random data, import data or build HTML5 database reports.

>> Take a look at DBSchema

Slow MySQL query performance is all too common. Of course it is. A good way to go is, naturally, a dedicated profiler that actually understands the ins and outs of MySQL.

Critically, it has very minimal impact on your server's performance, with most of the profiling work done separately - so it needs no server changes, agents or separate services.

Basically, you install the desktop application, connect to your MySQL server, hit the record button, and you'll have results within minutes:

>> Try out the Profiler

1. Introduction

In this article, we’ll show several algorithms for searching for a pattern in a large text. We’ll describe each algorithm with provided code and simple mathematical background.

Notice that provided algorithms are not the best way to do a full-text search in more complex applications. To do full-text search properly, we can use Solr or ElasticSearch.

2. Algorithms

We’ll start with a naive text search algorithm which is the most intuitive one and helps to discover other advanced problems associated with that task.

2.1. Helper Methods

Before we start, let’s define simple methods for calculating prime numbers which we use in Rabin Karp algorithm:

public static long getBiggerPrime(int m) {
    BigInteger prime = BigInteger.probablePrime(getNumberOfBits(m) + 1, new Random());
    return prime.longValue();
}
private static int getNumberOfBits(int number) {
    return Integer.SIZE - Integer.numberOfLeadingZeros(number);
}

2.2. Simple Text Search

Name of this algorithm describes it better than any other explanation. It’s the most natural solution:

public static int simpleTextSearch(char[] pattern, char[] text) {
    int patternSize = pattern.length;
    int textSize = text.length;

    int i = 0;

    while ((i + patternSize) <= textSize) {
        int j = 0;
        while (text[i + j] == pattern[j]) {
            j += 1;
            if (j >= patternSize)
                return i;
        }
        i += 1;
    }
    return -1;
}

The idea of this algorithm is straightforward: iterate through the text and if there is a match for the first letter of the pattern, check if all the letters of the pattern match the text.

If m is a number of the letters in the pattern, and n is the number of the letters in the text, time complexity of this algorithms is O(m(n-m + 1)).

Worst-case scenario occurs in the case of a String having many partial occurrences:

Text: baeldunbaeldunbaeldunbaeldun
Pattern: baeldung

2.3. Rabin Karp Algorithm

As mentioned above, Simple Text Search algorithm is very inefficient when patterns are long and when there is a lot of repeated elements of the pattern.

The idea of Rabin Karp algorithm is to use hashing to find a pattern in a text. At the beginning of the algorithm, we need to calculate a hash of the pattern which is later used in the algorithm. This process is called fingerprint calculation, and we can find a detailed explanation here.

The important thing about pre-processing step is that its time complexity is O(m) and iteration through text will take O(n) which gives time complexity of whole algorithm O(m+n).

Code of the algorithm:

public static int RabinKarpMethod(char[] pattern, char[] text) {
    int patternSize = pattern.length;
    int textSize = text.length;      

    long prime = getBiggerPrime(patternSize);

    long r = 1;
    for (int i = 0; i < patternSize - 1; i++) {
        r *= 2;
        r = r % prime;
    }

    long[] t = new long[textSize];
    t[0] = 0;

    long pfinger = 0;

    for (int j = 0; j < patternSize; j++) {
        t[0] = (2 * t[0] + text[j]) % prime;
        pfinger = (2 * pfinger + pattern[j]) % prime;
    }

    int i = 0;
    boolean passed = false;

    int diff = textSize - patternSize;
    for (i = 0; i <= diff; i++) {
        if (t[i] == pfinger) {
            passed = true;
            for (int k = 0; k < patternSize; k++) {
                if (text[i + k] != pattern[k]) {
                    passed = false;
                    break;
                }
            }

            if (passed) {
                return i;
            }
        }

        if (i < diff) {
            long value = 2 * (t[i] - r * text[i]) + text[i + patternSize];
            t[i + 1] = ((value % prime) + prime) % prime;
        }
    }
    return -1;

}

In worst-case scenario, time complexity for this algorithm is O(m(n-m+1)). However, on average this algorithm has O(n+m) time complexity.

Additionally, there is Monte Carlo version of this algorithm which is faster, but it can result in wrong matches (false positives).

2.4. Knuth-Morris-Pratt Algorithm

In the Simple Text Search algorithm, we saw how the algorithm could be slow if there are many parts of the text which match the pattern.

The idea of the Knuth-Morris-Pratt algorithm is the calculation of shift table which provides us with the information where we should search for our pattern candidates.

Java implementation of KMP algorithm:

public static int KnuthMorrisPrattSearch(char[] pattern, char[] text) {
    int patternSize = pattern.length;
    int textSize = text.length;

    int i = 0, j = 0;

    int[] shift = KnuthMorrisPrattShift(pattern);

    while ((i + patternSize) <= textSize) {
        while (text[i + j] == pattern[j]) {
            j += 1;
            if (j >= patternSize)
                return i;
        }

        if (j > 0) {
            i += shift[j - 1];
            j = Math.max(j - shift[j - 1], 0);
        } else {
            i++;
            j = 0;
        }
    }
    return -1;
}

And here is how we calculate shift table:

public static int[] KnuthMorrisPrattShift(char[] pattern) {
    int patternSize = pattern.length;

    int[] shift = new int[patternSize];
    shift[0] = 1;

    int i = 1, j = 0;
    
    while ((i + j) < patternSize) {
        if (pattern[i + j] == pattern[j]) {
            shift[i + j] = i;
            j++;
        } else {
            if (j == 0)
                shift[i] = i + 1;
            
            if (j > 0) {
                i = i + shift[j - 1];
                j = Math.max(j - shift[j - 1], 0);
            } else {
                i = i + 1;
                j = 0;
            }
        }
    }
    return shift;
}

The time complexity of this algorithm is also O(m+n).

2.5. Simple Boyer-Moore Algorithm

Two scientists, Boyer and Moore, came up with another idea. Why not compare the pattern to the text from right to left instead of left to right, while keeping the shift direction the same:

public static int BoyerMooreHorspoolSimpleSearch(char[] pattern, char[] text) {
    int patternSize = pattern.length;
    int textSize = text.length;

    int i = 0, j = 0;
    
    while ((i + patternSize) <= textSize) {
        j = patternSize - 1;
        while (text[i + j] == pattern[j]) {
            j--;
            if (j < 0)
                return i;
        }
        i++;
    }
    return -1;
}

As expected, this will run in O(m * n) time. But this algorithm led to the implementation of occurrence and the match heuristics which speeds up the algorithm significantly. We can find more here.

2.6. Boyer-Moore-Horspool Algorithm

There are many variations of heuristic implementation of the Boyer-Moore algorithm, and simplest one is Horspool variation.

This version of the algorithm is called Boyer-Moore-Horspool, and this variation solved the problem of negative shifts (we can read about negative shift problem in the description of the Boyer-Moore algorithm).

Like Boyer-Moore algorithm, worst-case scenario time complexity is O(m * n) while average complexity is O(n). Space usage doesn’t depend on the size of the pattern, but only on the size of the alphabet which is 256 since that is the maximum value of ASCII character in English alphabet:

public static int BoyerMooreHorspoolSearch(char[] pattern, char[] text) {

    int shift[] = new int[256];
    
    for (int k = 0; k < 256; k++) {
        shift[k] = pattern.length;
    }
    
    for (int k = 0; k < pattern.length - 1; k++){
        shift[pattern[k]] = pattern.length - 1 - k;
    }

    int i = 0, j = 0;

    while ((i + pattern.length) <= text.length) {
        j = pattern.length - 1;

        while (text[i + j] == pattern[j]) {
            j -= 1;
            if (j < 0)
                return i;
        }
        
        i = i + shift[text[i + pattern.length - 1]];
    }
    return -1;
}

4. Conclusion

In this article, we presented several algorithms for text search. Since several algorithms require stronger mathematical background, we tried to represent the main idea beneath each algorithm and provide it in a simple manner.

And, as always, the source code can be found over on GitHub.

String Search Algorithms for Large Texts with Java

Get started with Spring and Spring Boot, through the Learn Spring course:

1. Introduction

2. Algorithms

2.1. Helper Methods

2.2. Simple Text Search

2.3. Rabin Karp Algorithm

2.4. Knuth-Morris-Pratt Algorithm

2.5. Simple Boyer-Moore Algorithm

2.6. Boyer-Moore-Horspool Algorithm

4. Conclusion

Get started with Spring and Spring Boot, through the Learn Spring course:

REST with Spring

Learn Spring Security ▼▲

Learn Spring Security Core

Learn Spring Security OAuth

Learn Spring

Learn Spring Data JPA

Persistence

REST

Security

Full Archive

Baeldung Ebooks

About Baeldung

Write for Baeldung

Get started with Spring and Spring Boot, through the Learn Spring course:

1. Introduction

2. Algorithms

2.1. Helper Methods

2.2. Simple Text Search

2.3. Rabin Karp Algorithm

2.4. Knuth-Morris-Pratt Algorithm

2.5. Simple Boyer-Moore Algorithm

2.6. Boyer-Moore-Horspool Algorithm

4. Conclusion

Get started with Spring and Spring Boot, through the Learn Spring course: