Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: March 18, 2024
Regular expressions, or regex, are a powerful tool for pattern matching and text manipulation. Oftentimes, text manipulation requires us to write multiple lines of regular code. However, when we use regular expressions, it immensely reduces the lines of code required to accomplish the same task.
However, they can also be complex to write and understand, especially for more advanced patterns. So, in this tutorial, we’ll learn how to find n consecutive characters in text using regular expressions. We’ll start by reviewing the differences between Basic Regular Expressions (BRE) and Extended Regular Expressions (ERE).
Afterward, we’ll use the standard UNIX utilities like grep and egrep to find consecutive characters in the text.
Before we dive into the hands-on approach to finding n consecutive characters, we should get familiar with the BRE and ERE variants of regular expressions in the *nix ecosystem. Most of the utilities support both variants. However, some utilities might support one of the two.
BRE and ERE differ in terms of the available set of metacharacters. They also behave a bit differently.
The BRE syntax provides a simpler and more restricted set of pattern-matching capabilities compared to ERE. In BRE, characters are typically treated as literal unless they are metacharacters or escaped with a backslash.
Metacharacters have special meanings within patterns. For instance, a dot (.) matches any single character except a new line and an asterisk (*) matches zero or more occurrences of characters or groups of characters. In addition, there’s support for range notation, character classes, quantifiers, and escape characters.
Although simple, it still allows for effective text searching and manipulation. This is evident by tools like grep, sed, and Vi, which default to BRE syntax.
ERE is the expanded version of BRE. It provides additional features and metacharacters for pattern matching. Apart from that, it includes non-capturing groups and look-around assertions.
While ERE offers more features and flexibility, it’s not supported by all Linux utilities. However, tools like grep and sed support ERE with the -E flag. Moreover, tools like awk and egrep use ERE by default.
In this section, we’ll use different tools to find consecutive characters in the text. We’ll break down the regular expressions used by the different tools. Additionally, we’ll make use of both BRE and ERE syntax.
For our example, we’ll use the Lorem ipsum placeholder text:
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Sed vestibulum volutpat ante, sed euismod mi commodo at.
Nam sed finibus orci.
Pellentesque vel nulla nec massa semper tincidunt.
Sed eleifenddiam sit amet finibus facilisis.
Integer in urna sit amet lorem cursus suscipit id ut est.
Aliquam lobortis magna nec mi vulputate, et elementum nunc fringilla.
grep is a utility that we use to search files or input stream for lines that matches a specified pattern. There are two variants of grep – the BSD grep and the GNU grep. They both differ slightly, but our examples should work with both versions.
Let’s use grep with BRE syntax to match the lines with words that have double characters in them:
$ grep '.*\(.\)\1.*' lipsum.txt
Sed vestibulum volutpat ante, sed euismod mi commodo at.
Pellentesque vel nulla nec massa semper tincidunt.
...
Let’s break this pattern down:
Similarly, we can alter this pattern to work with a character that we specify:
$ grep '.*m\{2\}.*' lipsum.txt
Sed vestibulum volutpat ante, sed euismod mi commodo at.
Here, we’re looking for lines with words that have “mm” in them. In this case, it’s “commodo“. In the pattern, m\{2\} specifies that the character “m” should occur exactly two times consecutively. Therefore, we can change the number to specify the repetition count.
Moreover, we can also search for characters that repeat once or more than once, like “mm“, “mmm“, etc.:
$ grep '\(.\)\1\{1,\}' lipsum.txt
Sed vestibulum volutpat ante, sed euismod mi commodo at.
Pellentesque vel nulla nec massa semper tincidunt.
...
Here’s the breakdown:
In the examples above, we can see that we’re escaping the parenthesis and braces. It’s because they have special meanings when used unescaped. So, we need to escape them to treat them as literal characters.
On the other hand, it’s not true for ERE. We don’t need to escape the parenthesis and braces when using the -E flag with grep. Therefore, we can use them directly as literal characters without backslashes:
$ grep -E '(.)\1+' lipsum.txt
Sed vestibulum volutpat ante, sed euismod mi commodo at.
Pellentesque vel nulla nec massa semper tincidunt.
...
Here’s what happens:
In the next section, we’ll explore another variant of grep that relies on ERE.
egrep is essentially the same as grep, but it enables the use of Extended Regular Expressions by default. For that reason, we don’t have to supply the -E option. So, examples given for grep work without escaping the curly braces and parenthesis.
Let’s see how we can find two consecutive characters in text using egrep:
$ egrep '.*(.)\1.*' lipsum.txt
Sed vestibulum volutpat ante, sed euismod mi commodo at.
Pellentesque vel nulla nec massa semper tincidunt.
...
As we can see, we don’t need to escape the parenthesis.
In the same way, let’s use egrep to find words with exactly three consecutive characters:
$ egrep '(.)\1{2}' lipsum.txt
Obviously, we don’t have such words in the specified file. So, we’ll pipe some text that matches the pattern:
$ echo "HELLOOO\nWORLD" | egrep '(.)\1{2}'
HELLOOO
Similarly, for three or more characters, we can use the following pattern:
$ egrep '(.)\1{2,}' lipsum.txt
Here, we specify that the matched characters should be at least three or more in number. In the curly braces “{2,}“, we can specify the range like “{2,10}“.
In this article, we started by learning the basic differences between the BRE and ERE syntax. Then, we explored the different possibilities to find the words that contain n consecutive characters. For that purpose, we used the built-in grep and egrep utilities.