I just announced the new Spring Boot 2 material, coming in REST With Spring:

>> CHECK OUT THE COURSE

1. Overview

The chore of searching for a pattern of characters, or a word, in a larger text string is done in various fields. In bioinformatics, for example, we may need to find a DNA snippet in a chromosome.

In the media, editors locate a particular phrase in a voluminous text. Data surveillance detects scams or spam by looking for suspicious words embedded in data.

In any context, the search is so well-known and daunting a chore that it is popularly called the “Needle in a Haystack Problem”. In this tutorial, we’ll demonstrate a simple algorithm that uses the indexOf(String str, int fromIndex) method of the Java String class to find all occurrences of a word within a string.

2. Simple Algorithm

Instead of simply counting the occurrences of a word in a larger text, our algorithm will find and identify every location where a specific word exists in the text. Our approach to the problem is short and simple so that:

  1. The search will find the word even within words in the text. Therefore, if we’re searching for the word “able” then we’ll find it in “comfortable” and “tablet”.
  2. The search will be case-insensitive.
  3. The algorithm is based on the naïve string search approach. This means that since we’re naïve about the nature of the characters in the word and the text string, we’ll use brute force to check every location of the text for an instance of the search word.

2.1. Implementation

Now that we’ve defined the parameters for our search, let’s write a simple solution:

public class WordIndexer {

    public List<Integer> findWord(String textString, String word) {
        List<Integer> indexes = new ArrayList<Integer>();
        String lowerCaseTextString = textString.toLowerCase();
        String lowerCaseWord = word.toLowerCase();

        int index = 0;
        while(index != -1){
            index = lowerCaseTextString.indexOf(lowerCaseWord, index);
            if (index != -1) {
                indexes.add(index);
                index++;
            }
        }
        return indexes;
    }
}

2.2. Testing the Solution

To test our algorithm, we’ll use a snippet of a famous passage from Shakespeare’s Hamlet and search for the word “or”, which appears five times:

@Test
public void givenWord_whenSearching_thenFindAllIndexedLocations() {
    String theString;
    WordIndexer wordIndexer = new WordIndexer();

    theString = "To be, or not to be: that is the question: "
      + "Whether 'tis nobler in the mind to suffer "
      + "The slings and arrows of outrageous fortune, "
      + "Or to take arms against a sea of troubles, "
      + "And by opposing end them? To die: to sleep; "
      + "No more; and by a sleep to say we end "
      + "The heart-ache and the thousand natural shocks "
      + "That flesh is heir to, 'tis a consummation "
      + "Devoutly to be wish'd. To die, to sleep; "
      + "To sleep: perchance to dream: ay, there's the rub: "
      + "For in that sleep of death what dreams may come,";

    List<Integer> expectedResult = Arrays.asList(7, 122, 130, 221, 438);
    List<Integer> actualResult = wordIndexer.findWord(theString, "or");
    assertEquals(expectedResult, actualResult);
}

When we run our test, we get the expected result. Searching for “or” yields five instances embedded in various ways in the text string:

index of 7, in "or"
index of 122, in "fortune"
index of 130, in "Or
index of 221, in "more"
index of 438, in "For"

In mathematical terms, the algorithm has a Big-O notation of O(m*(n-m)), where m is the length of the word and n is the length of the text string. This approach may be appropriate for haystack text strings of a few thousand characters but will be intolerably slow if there are billions of characters.

3. Improved Algorithm

The simple example above demonstrates a naïve, brute-force approach to searching for a given word in a text string. As such, it will work for any search word and any text.

If we know in advance that the search word does not contain a repetitive pattern of characters, such as “aaa”, then we can write a slightly more efficient algorithm.

In this case, we can safely avoid doing the backup to re-check every location in the text string for as a potential starting location. After we make the call to the indexOf( ) method, we’ll simply slide over to the location just after the end of the latest occurrence found. This simple tweak yields a best-case scenario of O(n).

Let’s look at this enhanced version of the earlier findWord( ) method.

public List<Integer> findWordUpgrade(String textString, String word) {
    List<Integer> indexes = new ArrayList<Integer>();
    StringBuilder output = new StringBuilder();
    String lowerCaseTextString = textString.toLowerCase();
    String lowerCaseWord = word.toLowerCase();
    int wordLength = word.length();

    int index = 0;
    while(index != -1){
        index = lowerCaseTextString.indexOf(lowerCaseWord, index + wordLength);  // Slight improvement
        if (index != -1) {
            indexes.add(index);
        }
    }
    return indexes;
}

4. Conclusion

In this tutorial, we presented a case-insensitive search algorithm to find all variations of a word in a larger text string. But don’t let that hide the fact that the Java String class’ indexOf() method is inherently case-sensitive and can distinguish between “Bob” and “bob”, for example.

Altogether, indexOf() is a convenient method for finding a character sequence buried in a text string without doing any coding for substring manipulations.

As usual, the complete codebase of this example is over on GitHub.

I just announced the new Spring Boot 2 material, coming in REST With Spring:

>> CHECK OUT THE LESSONS

Leave a Reply

avatar
  Subscribe  
Notify of