Java Top

Get started with Spring 5 and Spring Boot 2, through the Learn Spring course:

>> CHECK OUT THE COURSE

1. Overview

Strings commonly contain a mixture of words and other delimiters. Sometimes, these strings may delimit words by a change in the case without whitespace. For example, the camel case capitalizes each word after the first, and the title case (or Pascal case) capitalizes every word.

We may wish to parse these strings back into words in order to process them.

In this short tutorial, we'll look at how to find the words in mixed case strings using regular expressions, and how to convert them into sentences or titles.

2. Use Cases for Parsing Capitalized Strings

A common use case for processing camel case strings might be the field names in a document. Let's say a document has a field “firstName” – we may wish to display that on-screen as “First name” or “First Name”.

Similarly, if we were to scan the types or functions in our application via reflection, in order to produce reports using their names, we would commonly find camel case or title case identifiers that we may wish to convert.

An extra problem we need to solve when parsing these expressions is that single-letter words cause consecutive capital letters.

For clarity:

  • thisIsAnExampleOfCamelCase
  • ThisIsTitleCase
  • thisHasASingleLetterWord

Now that we know the sorts of identifiers we need to parse, let's use a regular expression to find the words.

3. Find Words Using Regular Expressions

3.1. Defining a Regular Expression to Find Words

Let's define a regular expression to locate words that are either made of lowercase letters only, a single uppercase letter followed by lowercase letters, or a single uppercase letter on its own:

Pattern WORD_FINDER = Pattern.compile("(([A-Z]?[a-z]+)|([A-Z]))");

This expression provides the regular expression engine with two options. The first uses “[A-Z]?” to mean “an optional first capital letter” and then “[a-z]+” to mean “one or more lowercase letters”. After that, there's the “|” character to provide or logic, followed by the expression “[A-Z]”, which means “a single capital letter”.

Now that we have the regular expression, let's parse our strings.

3.2. Finding Words in a String

We'll define a method to use this regular expression:

public List<String> findWordsInMixedCase(String text) {
    Matcher matcher = WORD_FINDER.matcher(text);
    List<String> words = new ArrayList<>();
    while (matcher.find()) {
        words.add(matcher.group(0));
    }
    return words;
}

This uses the Matcher created by the regular expression's Pattern to help us find the words. We iterate over the matcher while it still has matches, adding them to our list.

This should extract anything that meets our word definition. Let's test it.

3.3. Testing the Word Finder

Our word finder should be able to find words that are separated by any non-word characters, as well as by changes in the case. Let's start with a simple example:

assertThat(findWordsInMixedCase("some words"))
  .containsExactly("some", "words");

This test passes and shows us that our algorithm is working. Next, we'll try the camel case:

assertThat(findWordsInMixedCase("thisIsCamelCaseText"))
  .containsExactly("this", "Is", "Camel", "Case", "Text");

Here we see that the words are extracted from a camel case String and come out with their capitalization unchanged. For example, “Is” started with a capital letter in the original text, and is capitalized when extracted.

We can also try this with title case:

assertThat(findWordsInMixedCase("ThisIsTitleCaseText"))
  .containsExactly("This", "Is", "Title", "Case", "Text");

Plus, we can check that single letter words are extracted as we intended:

assertThat(findWordsInMixedCase("thisHasASingleLetterWord"))
  .containsExactly("this", "Has", "A", "Single", "Letter", "Word");

So far, we've built a word extractor, but these words are capitalized in a way that may not be ideal for output.

4. Convert Word List to Human Readable Format

After extracting a list of words, we probably want to use methods like toUpperCase or toLowerCase to normalize them. Then we can use String.join to join them back into a single string with a delimiter. Let's look at a couple of ways to achieve real-world use cases with these.

4.1. Convert to Sentence

Sentences start with a capital letter and end in a period“.”. We're going to need to be able to make a word start with a capital letter:

private String capitalizeFirst(String word) {
    return word.substring(0, 1).toUpperCase()
      + word.substring(1).toLowerCase();
}

Then we can loop through the words, capitalizing the first, and making the others lowercase:

public String sentenceCase(List<String> words) {
    List<String> capitalized = new ArrayList<>();
    for (int i = 0; i < words.size(); i++) {
        String currentWord = words.get(i);
        if (i == 0) {
            capitalized.add(capitalizeFirst(currentWord));
        } else {
            capitalized.add(currentWord.toLowerCase());
        }
    }
    return String.join(" ", capitalized) + ".";
}

The logic here is that the first word has its first character capitalized, and the rest are in lowercase. We join them with a space as the delimiter and add a period in the end.

Let's test this out:

assertThat(sentenceCase(Arrays.asList("these", "Words", "Form", "A", "Sentence")))
  .isEqualTo("These words form a sentence.");

4.2. Convert to Title Case

Title case has slightly more complex rules than a sentence. Each word must have a capital letter, unless it's a special stop word that isn't normally capitalized. However, the whole title must start with a capital letter.

We can achieve this by defining our stop words:

Set<String> STOP_WORDS = Stream.of("a", "an", "the", "and", 
  "but", "for", "at", "by", "to", "or")
  .collect(Collectors.toSet());

After this, we can modify the if statement in our loop to capitalize any word that's not a stop word, as well as the first:

if (i == 0 || 
  !STOP_WORDS.contains(currentWord.toLowerCase())) {
    capitalized.add(capitalizeFirst(currentWord));
 }

The algorithm to combine the words is the same, though we don't add the period in the end.

Let's test it out:

assertThat(capitalizeMyTitle(Arrays.asList("title", "words", "capitalize")))
  .isEqualTo("Title Words Capitalize");

assertThat(capitalizeMyTitle(Arrays.asList("a", "stop", "word", "first")))
  .isEqualTo("A Stop Word First");

5. Conclusion

In this short article, we looked at how to find the words in a String using a regular expression. We saw how to define this regular expression to find different words using capitalization as a word boundary.

We also looked at some simple algorithms for taking a list of words and converting them into the correct capitalization for a sentence or a title.

As always, the example code can be found over on GitHub.

Java bottom

Get started with Spring 5 and Spring Boot 2, through the Learn Spring course:

>> CHECK OUT THE COURSE
Generic footer banner
Comments are closed on this article!