When working with text data in Java, it’s often necessary to extract specific pieces of information using regular expressions, also known as Regex. However, it’s not always enough to simply match the regex pattern. Sometimes, we may need to extract the text that follows after the Regex match.
In this tutorial, we’ll explore how to achieve this in Java.
2. Introduction to the Problem
First, let’s understand the problem quickly through an example. Let’s say we have a string variable INPUT1:
static String INPUT1 = "Some text, targetValue=Regex is cool";
Taking INPUT1 as the input, our target is to get the text after “targetValue=“, which is “Regex is cool“.
Therefore, in this example, if we write a Regex pattern to match “targetValue=“, we must extract everything after the match. However, the problem could have a variant. So, let’s see another input variable:
static String INPUT2 = "Some text. targetValue=Java is cool. some other text";
As shown in the INPUT2 example above, we still have “targetValue=” in the input text. However, we don’t want to obtain everything after the match this time. Instead, we want to extract “Java is cool” from the text after the match. In other words, we need the text after the match until the first period. Well, in practice, the period character could be various patterns.
Next, we’ll explore different approaches to solving the problem. Of course, we’ll cover both INPUT1 and INPUT2 cases.
We’ll use unit test assertions to verify if a solution can extract the expected result. Also, for simplicity, we’ll skip the input validation part, such as examing whether the input string contains the Regex pattern.
So now, let’s see them in action.
3. Using the split() Method
The standard split() method allows us to split one string by a delimiter into multiple strings as an array. Moreover, the delimiter can be a Regex pattern.
So, to solve the INPUT1 problem, we can simply use “targetValue=” as the pattern to split the input string. Then, the second element in the result array will be the result:
"Some text, targetValue=Regex is cool" ---split by "targetValue="--> [ "Some text, ", "Regex is cool" ]
Now, let’s implement this idea and check if it works:
String result1 = INPUT1.split("targetValue="); assertEquals("Regex is cool", result1);
The test passes if we give it a run. Therefore, “split and take” solves the INPUT1 problem.
In the test above, we access the array element directly without checking the length first. This is because we assume that our inputs are valid for simplicity, as we’ve mentioned earlier. However, if we work on a real project, it’s a good practice to check the length before accessing the array element to avoid ArrayIndexOutOfBandsException.
Next, let’s have a look at the INPUT2 case. One idea that may come up to solve the problem is using “targetValue=” or the literal dot as the Regex pattern for the split() method. Then, we can still take the second element from the array result.
However, this idea won’t work for our INPUT2 since the input has another dot before “targetValue=“: INPUT2 = “Some text. targetValue=…”.
If we call “targetValue=” pattern1 and the “.” character pattern2, in the real world, we cannot predicate how many pattern2 matches exist in the text before pattern1. Therefore, the simple “split and take” approach won’t work here.
However, we can split the input twice to get the target value:
"Some text. targetValue=Java is cool. some other text" Split by "targetValue=" -> [ "Some text. ", "Java is cool. some other text" ] Take the second element and split by "." -> [ "Java is cool", " some other text" ] The first element is the result
So next, let’s apply this approach in a test:
String afterFirstSplit = INPUT2.split("targetValue="); assertEquals("Java is cool. some other text", afterFirstSplit); String result2 = afterFirstSplit.split("[.]"); assertEquals("Java is cool", result2);
It’s worth mentioning that the period character has a special meaning in Regex (matching any character). Therefore, in the second split() call, afterFirstSplit.split(“[.]”), we must put the period character in a character class or escape it (“\\.“). Otherwise, every character becomes the delimiter of the split() method, and we’ll have an empty array:
// if we use the dot as the regex for splitting, the result array is empty String splitByDot = INPUT2.split("targetValue=").split("."); assertEquals(0, splitByDot.length);
4. Using the replaceAll() Method
Like the split() method, the replaceAll() method supports Regex patterns, too. We can use replaceAll() to replace the text we don’t need with an empty string to get the expected result.
For example, to solve the INPUT1 problem, we can replace everything until “targetValue=” (inclusive) with an empty string:
String result1 = INPUT1.replaceAll(".*targetValue=", ""); assertEquals("Regex is cool", result1);
Similar to the split() solution, we can call the replaceAll() method twice to solve the INPUT2 problem:
String afterFirstReplace = INPUT2.replaceAll(".*targetValue=", ""); assertEquals("Java is cool. some other text", afterFirstReplace); String result2 = afterFirstReplace.replaceAll("[.].*", ""); assertEquals("Java is cool", result2);
5. Using Capturing Groups
Java Regex API allows us to define capturing groups in the pattern. The Regex engine will attach index numbers to the capturing groups so that we can back reference the groups using these indexes.
Next, let’s see how to solve the INPUT1 problem using capturing groups:
Pattern p1 = Pattern.compile("targetValue=(.*)"); Matcher m1 = p1.matcher(INPUT1); assertTrue(m1.find()); String result1 = m1.group(1); assertEquals("Regex is cool", result1);
As we can see in the test above, we’ve created the Regex pattern “targetValue=(.*)“. So, everything after “targetValue=” is in a capturing group. Further, since this is the first group in the pattern, it has the index number 1. Therefore, after the Pattern.matcher() call, we can get the text in the group by calling matcher.group(1).
For the INPUT2 case, we won’t put everything after “targetValue=” into the group. Instead, we can make the group contain everything until the first period using a nor character class “[^.]*“. Next, let’s see it in action:
Pattern p2 = Pattern.compile("targetValue=([^.]*)"); Matcher m2 = p2.matcher(INPUT2); assertTrue(m2.find()); String result2 = m2.group(1); assertEquals("Java is cool", result2);
Alternatively, we can use the non-greedy quantifier ‘*?’ to achieve the same goal:
Pattern p3 = Pattern.compile("targetValue=(.*?)[.]"); Matcher m3 = p3.matcher(INPUT2); assertTrue(m3.find()); String result3 = m3.group(1); assertEquals("Java is cool", result3);
When we handle the INPUT2 case, split() and replaceAll() approaches need two steps to do the job. As we can see, using Regex’s capturing groups, we can solve the INPUT2 problem in one shot.
6. Using Lookaround Assertions
Java Regex API supports lookaround assertions. Lookaround assertions are useful when we want to match a pattern based on its surrounding characters without actually including those characters in the match.
Next, let’s explore how to solve the INPUT1 case using lookaround assertions:
Pattern p1 = Pattern.compile("(?<=targetValue=).*"); Matcher m1 = p1.matcher(INPUT1); assertTrue(m1.find()); String result1 = m1.group(); assertEquals("Regex is cool", result1);
As we can see in the code above, we’ve used a positive lookbehind assertion in the Regex pattern: “(?<=targetValue=).*“. It matches any character that appears after the string “targetValue=“.
Similarly, we can change “.” to the nor character class “[^.]” to solve the INPUT2 case:
Pattern p2 = Pattern.compile("(?<=targetValue=)[^.]*"); Matcher m2 = p2.matcher(INPUT2); assertTrue(m2.find()); String result2 = m2.group(); assertEquals("Java is cool", result2);
Alternatively, we can use both a positive lookbehind assertion and a positive lookahead assertion to extract the text we need:
Pattern p3 = Pattern.compile("(?<=targetValue=).*(?=[.])"); Matcher m3 = p3.matcher(INPUT2); assertTrue(m3.find()); String result3 = m3.group(); assertEquals("Java is cool", result3);
In the code above:
- (?<=targetValue=) is the positive lookbehind assertion that we’ve seen when solving the INPUT1 problem.
- (?=[.]) is the positive lookahead assertion.
Therefore, “(?<=targetValue=).*(?=[.])” matches any characters between “targetValue=” and a period character, which is exactly the result we’re after.
In this article, we’ve looked at two variations on the problem of extracting text that follows after a regex match. One returns everything after the matching regex, and the other returns everything after one regex match but before a second, different regex match.
Moreover, we’ve learned four different approaches to solving both of these scenarios through examples.
As usual, all code snippets presented here are available over on GitHub.