Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE

1. Overview

Non-capturing groups are important constructs within Java Regular Expressions. They create a sub-pattern that functions as a single unit but does not save the matched character sequence. In this tutorial, we’ll explore how to use non-capturing groups in Java Regular Expressions.

2. Regular Expression Groups

Regular expression groups can be one of two types: capturing and non-capturing.

Capturing groups save the matched character sequence. Their values can be used as backreferences in the pattern and/or retrieved later in code.

Although they don’t save the matched character sequence, non-capturing groups can alter pattern matching modifiers within the group. Some non-capturing groups can even discard backtracking information after a successful sub-pattern match.

Let’s explore some examples of non-capturing groups in action.

3. Non-Capturing Groups

A non-capturing group is created with the operator (?:X)“. The “X” is the pattern for the group:

Pattern.compile("[^:]+://(?:[.a-z]+/?)+")

This pattern has a single non-capturing group. It will match a value if it is URL-like. A full regular expression for a URL would be much more involved. We’re using a simple pattern to focus on non-capturing groups.

The pattern “[^:]:” matches the protocol — for example, “http://“. The non-capturing group “(?:[.a-z]+/?)” matches the domain name with an optional slash. Since the “+” operator matches one or more occurrences of this pattern, we’ll match the subsequent path segments as well. Let’s test this pattern on a URL:

Pattern simpleUrlPattern = Pattern.compile("[^:]+://(?:[.a-z]+/?)+");
Matcher urlMatcher
  = simpleUrlPattern.matcher("http://www.microsoft.com/some/other/url/path");
    
Assertions.assertThat(urlMatcher.matches()).isTrue();

Let’s see what happens when we try to retrieve the matched text:

Pattern simpleUrlPattern = Pattern.compile("[^:]+://(?:[.a-z]+/?)+");
Matcher urlMatcher = simpleUrlPattern.matcher("http://www.microsoft.com/");
    
Assertions.assertThat(urlMatcher.matches()).isTrue();
Assertions.assertThatThrownBy(() -> urlMatcher.group(1))
  .isInstanceOf(IndexOutOfBoundsException.class);

The regular expression is compiled into a java.util.Pattern object. Then, we create a java.util.Matcher to apply our Pattern to the provided value.

Next, we assert that the result of matches() returns true.

We used a non-capturing group to match the domain name in the URL. Since non-capturing groups do not save matched text, we cannot retrieve the matched text “www.microsoft.com/”. Attempting to retrieve the domain name will result in an IndexOutOfBoundsException.

3.1. Inline Modifiers

Regular expressions are case-sensitive. If we apply our pattern to a mixed-case URL, the match will fail:

Pattern simpleUrlPattern
  = Pattern.compile("[^:]+://(?:[.a-z]+/?)+");
Matcher urlMatcher
  = simpleUrlPattern.matcher("http://www.Microsoft.com/");
    
Assertions.assertThat(urlMatcher.matches()).isFalse();

In the case where we want to match uppercase letters as well, there are a few options we could try.

One option is to add the uppercase character range to the pattern:

Pattern.compile("[^:]+://(?:[.a-zA-Z]+/?)+")

Another option is to use modifier flags. So, we can compile the regular expression to be case-insensitive:

Pattern.compile("[^:]+://(?:[.a-z]+/?)+", Pattern.CASE_INSENSITIVE)

Non-capturing groups allow for a third option: We can change the modifier flags for just the group. Let’s add the case-insensitive modifier flag (“i“) to the group:

Pattern.compile("[^:]+://(?i:[.a-z]+/?)+");

Now that we’ve made the group case-insensitive, let’s apply this pattern to a mixed-case URL:

Pattern scopedCaseInsensitiveUrlPattern
  = Pattern.compile("[^:]+://(?i:[.a-z]+/?)+");
Matcher urlMatcher
  = scopedCaseInsensitiveUrlPattern.matcher("http://www.Microsoft.com/");
    
Assertions.assertThat(urlMatcher.matches()).isTrue();

When a pattern is compiled to be case-insensitive, we can turn it off by adding the “-” operator in front of the modifier. Let’s apply this pattern to another mixed-case URL:

Pattern scopedCaseSensitiveUrlPattern
  = Pattern.compile("[^:]+://(?-i:[.a-z]+/?)+/ending-path", Pattern.CASE_INSENSITIVE);
Matcher urlMatcher
  = scopedCaseSensitiveUrlPattern.matcher("http://www.Microsoft.com/ending-path");
  
Assertions.assertThat(urlMatcher.matches()).isFalse();

In this example, the final path segment “/ending-path” is case-insensitive. The “/ending-path” portion of the pattern will match uppercase and lowercase characters.

When we turned off the case-insensitive option within the group, the non-capturing group only supported lowercase characters. Therefore, the mixed-case domain name did not match.

4. Independent Non-Capturing Groups

Independent non-capturing groups are a type of regular expression group. These groups discard backtracking information after finding a successful match. When using this type of group, we need to be aware of when backtracking can occur. Otherwise, our patterns may not match the values we think they should.

Backtracking is a feature of Nondeterministic Finite Automaton (NFA) regular expression engines. When the engine fails to match text, the NFA engine can explore alternatives in the pattern. The engine will fail the match after exhausting all available alternatives. We only cover backtracking as it relates to independent non-capturing groups.

An independent non-capturing group is created with the operator “(?>X)” where X is the sub-pattern:

Pattern.compile("[^:]+://(?>[.a-z]+/?)+/ending-path");

We have added “/ending-path” as a constant path segment. Having this additional requirement forces a backtracking situation. The domain name and other path segments can match the slash character. To match “/ending-path”, the engine will need to backtrack. By backtracking, the engine can remove the slash from the group and apply it to the “/ending-path” portion of the pattern.

Let’s apply our independent non-capturing group pattern to a URL:

Pattern independentUrlPattern
  = Pattern.compile("[^:]+://(?>[.a-z]+/?)+/ending-path");
Matcher independentMatcher
  = independentUrlPattern.matcher("http://www.microsoft.com/ending-path");
    
Assertions.assertThat(independentMatcher.matches()).isFalse();

The group matches the domain name and the slash successfully. So, we leave the scope of the independent non-capturing group.

This pattern requires a slash to appear before “ending-path”. However, our independent non-capturing group has matched the slash.

The NFA engine should try backtracking. Since the slash is optional at the end of the group, the NFA engine would remove the slash from the group and try again. The independent non-capturing group has discarded the backtracking information. So, the NFA engine cannot backtrack.

4.1. Backtracking Inside the Group

Backtracking can occur within an independent non-capturing group. While the NFA engine is matching the group, the backtracking information has not been discarded. The backtracking information is not discarded until after the group matches successfully:

Pattern independentUrlPatternWithBacktracking
  = Pattern.compile("[^:]+://(?>(?:[.a-z]+/?)+/)ending-path");
Matcher independentMatcher
  = independentUrlPatternWithBacktracking.matcher("http://www.microsoft.com/ending-path");
    
Assertions.assertThat(independentMatcher.matches()).isTrue();

Now we have a non-capturing group within an independent non-capturing group. We still have a backtracking situation involving the slash in front of “ending-path”. However, we have enclosed the backtracking portion of the pattern inside of the independent non-capturing group. The backtracking will occur within the independent non-capturing group. Therefore the NFA engine has enough information to backtrack, and the pattern matches the provided URL.

5. Conclusion

We’ve shown that non-capturing groups are different from capturing groups. However, they function as a single unit like their capturing counterparts. We have also shown that non-capturing groups can enable or disable the modifiers for the group instead of the pattern as a whole.

Similarly, we’ve shown how independent non-capturing groups discard backtracking information. Without this information, the NFA engine cannot explore alternatives to make a successful match. However, backtracking can occur within the group.

As always, the source code is available over on GitHub.

Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE
res – REST with Spring (eBook) (everywhere)
2 Comments
Oldest
Newest
Inline Feedbacks
View all comments
Comments are closed on this article!