What Are the Differences Between [0-9], [[:digit:]] and \d

1. Overview

Regular expressions (regex) play a crucial role in pattern matching and text processing. In Linux environments, tools such as sed, awk, and grep depend heavily on regex for search capabilities.

One common use case of regex is matching digits. When performing this task, it’s not uncommon for us to use either [0-9], [[:digit:]], or \d. These regular expressions may seem identical since they all appear to match digits. However, especially when used with Linux and POSIX-compliant tools, they display important differences.

In this article, we’ll explore these expressions, compare them, discuss how they behave differently under various conditions, and provide practical Linux examples to show their differences.

2. Key Concepts

Here are some key concepts we use throughout the article.

2.1. ASCII Digit Characters

ASCII digits are the standard numeric characters specified in the ASCII character set, 0 1 2 3 4 5 6 7 8 9. These characters are recognized universally by all text-processing tools that utilize basic regex and correspond to code points 48 through 57 in the ASCII table.

2.2. Non-ASCII Digit Characters

Non-ASCII digits are numeric characters outside the standard ASCII range, like digits from non-Latin writing systems, such as Devanagari digits and Arabic-Indic digits.

2.3. Locale

A locale in Linux defines regional and cultural settings, including language, character encoding, and classifications. It influences how text is interpreted and what qualifies as a digit. For instance, locale can influence [[:digit:]] by expanding it to include non-ASCII digits, based on language and system settings.

We can check our locale with:

$ echo $LC_CTYPE

Above, the command displays the current value of the environment variable LC_CTYPE in the shell.

So, in our system’s local settings, LC_CTYPE influences:

Which characters are treated as digits, letters, and whitespace
How to convert characters between uppercase and lowercase
Whether non-ASCII characters are recognized

For instance, if LC_CTYPE is en_US.UTF-8:

$ echo $LC_CTYPE
en_US.UTF-8

then the output indicates that our system uses the English (United States) locale with UTF-8 encoding, enabling support for a wide range of Unicode characters.

If LC_CTYPE is unset or empty:

$ echo $LC_CTYPE

then the system defaults to the value of the LANG environment variable:

$ echo $LANG
en_GB.UTF-8

Here, the system would use the British English locale with UTF-8 encoding.

2.4. Unicode Awareness

Unicode awareness is the ability of a regex engine to identify characters beyond the ASCII range, such as alphabets and symbols from around the world. Thus, regex engines with Unicode awareness can match characters such as Arabic or Chinese digits and emojis.

2.5. Compatibility

By compatibility, we mean whether a regex syntax is supported across different tools and environments.

2.6. POSIX Character Classes

POSIX character classes refer to named sets of characters enclosed in [[: … :]]. POSIX-compliant tools like grep, sed, and awk utilise these classes:

[[:digit:]] – matches digit characters
[[:space:]] – matches whitespace characters
[[:alpha:]] – matches alphabetic characters

POSIX character classes provide a way to describe character groups.

2.7. Perl-Compatible Regular Expressions (PCRE)

The regex library PCRE implements syntax similar to that in the Perl programming language. It supports advanced regex features such as unicode awareness, lookahead/lookbehind, and extended escape sequences (\d, \w,\s).

3. Understanding Each Expression

In this section, we take a closer look at the patterns [0-9], [[:digit:]], and \d. The behavior of these patterns can differ based on the tool and locale settings. To demonstrate these differences, let’s use practical Linux examples.

3.1. [0-9]

The character class [0-9] matches any character in the ASCII range from 0 to 9:

Locale – not locale-sensitive since it’s meant to match only ASCII digits
Unicode awareness – it’s not Unicode aware since it’s not meant to match non-ASCII digits
Compatibility – supported in most regex engines and tools

[0-9] matches only one digit character in the ASCII range at a time:

$ echo "abc7x8yz" | grep -o '[0-9]'
7
8

The command above matches each ASCII digit and then prints each on a new line.

However, when UTF-8 and Unicode characters are involved, this behavior may slightly vary depending on the grep version:

$ echo "abc7٣５yz" | grep -o '[0-9]'
7
٣
５

Above, the output includes Arabic-Indic digit three (٣) and the Fullwidth digit five (５). Some versions of grep, especially with multibyte character support under a UTF-8 locale, can appear to match non-ASCII digits, which is an unexpected behavior. This can be misleading since [0-9] is defined to match only ASCII digits. Unicode character matching in this case results from how the regex engine interprets multibyte characters and not because [0-9] intends to match them. For instance, in the case of grep versions, the behavior may vary depending on whether we’re using BSD grep, GNU grep, or busybox grep:

$ grep --version
grep (GNU grep) 3.7
...

GNU grep supports multibyte characters better than other versions, which may explain the unexpected behavior.

To strictly match only ASCII digits, we can rely on a tool like Perl:

$ echo "abc7٣５yz" | perl -CSD -nE 'say for /([0-9])/g'
7

Irrespective of the locale setting, [0-9] in Perl strictly matches ASCII digits.

3.2. [[:digit:]]

The POSIX character class [[:digit:]] matches digit characters as defined by the current locale settings:

Locale – locale-sensitive
Unicode awareness – depends on the tool and locale settings
Compatibility – POSIX-compliant tools such as grep, sed, and awk support it

Let’s see how it works under a UTF-8 locale with grep:

$ echo "abc7٣５yz" | grep -o '[[:digit:]]'
7

Above, the grep command only matches the ASCII digit 7. Even though [[:digit:]] is locale-sensitive, it mostly doesn’t include non-ASCII digits in POSIX-compliant tools like awk or grep. Perl, on the other hand, extends the behavior of [[:digit:]]:

$ echo "abc7٣５yz" | perl -CSD -nE 'say for /([[:digit:]])/g'
7
٣
５

So, Perl considers Unicode and locale definitions. For this reason, it recognizes all three digits, including the non-ASCII Unicode digits.

3.3. \d

The escape sequence \d, part of Perl-Compatible Regular Expressions (PCRE), helps to match digit characters:

Locale – locale-insensitive
Unicode awareness – it’s Unicode-aware in PCRE and Perl
Compatibility – supported in tools that offer PCRE, for instance, grep -P and perl, and not in basic grep

So if we use the same phase as before, this time with -P for PCRE:

$ echo "abc7x8yz" | grep -oP '\d'
7
8

we see that the command matches ASCII digits.

Now, if we use perl:

$ echo "abc7٣５yz" | perl -CSD -nE 'say for /(\d)/g'
7
٣
５

\d matches the non-ASCII digits Arabic-Indic (٣) and Fullwidth (５) alongside the ASCII digit 7. By default, \d is Unicode aware in Perl.

Let’s use \d in basic grep:

$ echo "abc7٣５yz" | grep -o '\d'

Without the support of -P, we get no output since basic grep doesn’t recognize \d.

4. Conclusion

In this article, we explored differences between [0-9], [[:digit:]], and \d.

Although [0-9], [[:digit:]], and \d may appear to work the same, their behavior varies significantly depending on the locale, Unicode support, and the regex engine in use. So, we can use [0-9] to strictly match ASCII digits and ensure broad compatibility. Meanwhile, we can utilize [[:digit:]] if we’re relying on POSIX tools and need locale-sensitive matching. However, when using it, support for Unicode is limited in tools such as grep. Finally, we can use \d with Perl or PCRE-enabled tools (perl, grep -P) for Unicode-aware digit matching.

Thus, we can utilize these differences to match digits across different environments.

Administration

Scripting

Networking

Files

Processes

Full Archive

About Baeldung