Authors Top

If you have a few years of experience in the Linux ecosystem, and you’re interested in sharing that experience with the community, have a look at our Contribution Guidelines.

1. Overview

Regular expressions (Regex) are widely used in the Linux command line. Many common commands support Regex, such as grep, sed, and awk.

Some of us may have encountered a case where a particular Regex doesn’t work with Linux commands – for instance, a pattern containing \d – however, the same Regex works well with Java or Python. This may confuse us.

In this tutorial, let’s take a closer look at this sort of problem and explain why it can happen.

2. Introduction to the Problem

As usual, let’s understand the problem through an example. First, let’s create a text file as our input:

$ cat input.txt
Linux is awesome!
This server is running the Linux kernel 5.16.5-arch1-1.
It has many powerful commands.

The input.txt file contains three lines.

We know the Regex [0-9] matches one single digit. So, the command grep ‘[0-9]’ input.txt should match the second line in the input.txt file:

$ grep '[0-9]' input.txt 
This server is running the Linux kernel 5.16.5-arch1-1.

Further, we may have learned that “\d is the short form of [0-9].” So, let’s replace the Regex in the grep command with\d and try again:

$ grep '\d' input.txt
It has many powerful commands.

As the output above shows, it seems that grep doesn’t recognize\d as [0-9]. Instead, it treats \d as a literal letter ‘d‘. Therefore, only the last line is matched.

If we test the same Regex with sed or awk, we can get the same result:

$ sed -n '/\d/p' input.txt 
It has many powerful commands.

$ awk '/\d/' input.txt
awk: cmd. line:1: warning: regexp escape sequence `\d' is not a known regexp operator
It has many powerful commands.

Moreover, the awk command explicitly throws a warning message saying that ‘\d’ is unknown.

However, we can get the expected output if we test the same Regex and the input file in Java, Python, or PHP.

So, why isn’t \d supported by Linux commands? Next, let’s figure it out.

3. BRE, ERE, and PCRE

To answer the question, we should understand the different Regex flavors. There are three commonly used Regex syntaxes — BRE, ERE, and PCRE:

  • BRE – Basic Regular Expressions
  • ERE – Extended Regular Expressions
  • PCRE – Perl Compatible Regular Expressions

BRE came earliest. It has limited features and expressiveness. Then, BRE was extended to ERE. Later, PCRE joined the Regex party with a rich set of powerful features.

We won’t dive into each Regex syntax and make this a complete Regex tutorial. Instead, we’ll discuss some differences between BRE, ERE, and PCRE through some examples.

3.1. BRE

As we’ve mentioned earlier, BRE is the oldest Regex syntax. As its name implies, it supports only pretty basic features. For instance, the following features are not supported by the standard POSIX BRE:

  • ‘|’ – alternation
  • ‘?’ – 0 or 1
  • ‘+’ – 1 or more
  • ‘\s’ – shorthand for whitespace

Also, we need to escape “{m, n}” (possessive quantifiers) and “(…)” (grouping) to give them special meaning. For example, “[0-9]\{2,4\}” matches two, three, or four digits.

After ERE was introduced, most Regex engines, such as GNU BRE, supported some shorthand such as ‘\s‘ in BRE. Further, |, ?, and + are supported in BRE as well. However, we need to escape them to bring them special meaning. For example, the BRE “a\|b” matches a or b.

3.2. ERE

ERE has extended BRE. With ERE, we don’t need to escape |, ?, +, ( ), and { } to give them special meaning. For example, “a|b” matches a or b, and “[0-9]{2,4}” matches two, three, or four digits.

However, if we want to match those characters literally, we need to escape them. For instance, “a\|b” matches the literal string “a|b”.

3.3. PCRE

In the beginning, PCRE was a library to implement the Perl Regex engine. Later, since Perl popularized Regex, it became a popular Regex flavor. Many other utilities and programming languages have Regex engines compatible with PCRE — for instance, Java, Python, and PHP.

PCRE’s syntax is much more powerful and flexible than BRE and ERE. Let’s have a look at a few features only available in PCRE:

  • Look-around – Positive and negative look-ahead/look-behind
  • Non-greedy matching – *?, +?, and {m, }?
  • Case-sensitive/insensitive matching – (?i) and (?-i)
  • Shorthand for matching a digit or non-digit character – \d and \D

Now, we know that we’re using PCRE when we use ‘\d‘. Only PCRE-compatible Regex engines can interpret PCRE correctly.

Next, let’s take a look at the Linux commands and which Regex flavors they support.

4. Regex Flavor of grep, sed, and awk

In this section, we’ll take the widely used GNU grep, GNU sed, and GNU awk as examples.

4.1. GNU grep

grep is by default in GNU BRE matching mode. That is to say, if we don’t set an option, it only supports BRE syntax. For example, we can match a line containing either “awesome” or “powerful“:

$ grep 'awesome\|powerful' input.txt 
Linux is awesome!
It has many powerful commands

As we’ve seen in the command above, we’ve escaped the ‘|’ character to give it special meaning.

grep allows us to use the -E option to interpret patterns as ERE. Let’s do the same test with the -E option:

$ grep -E 'awesome|powerful' input.txt 
Linux is awesome!
It has many powerful commands.

Note that we shouldn’t escape the ‘|’ when we pass the -E option to grep. Otherwise, grep will search the literal ‘|’ character.

GNU grep supports the -P option to interpret PCRE patterns. Therefore, if we want the grep command to match PCRE, for instance, “\d“, we should use the -P option:

$ grep -P '\d' input.txt 
This server is running the Linux kernel 5.16.5-arch1-1.

As we can see, grep supports “\d“, but we must use the right option.

4.2. GNU sed and GNU awk

As is the case with grepsed uses BRE by default. Additionally, we can pass the -r option to tell sed to use GNU ERE for pattern matching:

$ sed -n '/awesome\|powerful/p' input.txt
Linux is awesome!
It has many powerful commands.

$ sed -nr '/awesome|powerful/p' input.txt
Linux is awesome!
It has many powerful commands

However, sed doesn’t support PCRE. Therefore, sed cannot interpret “\d”.

On the other hand, GNU awk supports GNU ERE. Similarly, awk doesn’t support PCRE, either.

Consequently, we cannot use PCRE-unique features with sed and awk.

5. Conclusion

In this article, first, through an example, we’ve introduced the question that confused us: Why isn’t Regex \d supported by Linux commands, such as grep and sed?

Then, on the journey of seeking the answer to the question, we’ve discussed the three Regex flavors: BRE, ERE, and PCRE.

Further, we’ve talked about Regex compatibilities of common Linux commands such as grep, sed, and awk. Also, we’ve found the answer to the question.

Authors Bottom

If you have a few years of experience in the Linux ecosystem, and you’re interested in sharing that experience with the community, have a look at our Contribution Guidelines.

Comments are closed on this article!