Grep – How to Output Only the Content of a Capturing Group

1. Overview

grep is a command-line tool used for searching text within files or an input stream. Furthermore, it enables us to search for specific patterns within files and filter out matching lines.

When we need to extract specific information using grep, we use regular expressions, which enable us to define complex patterns for searching and extracting specific information from text. In regular expressions, we can use capturing groups to isolate and extract specific parts of a matched pattern.

In this tutorial, we’ll discuss how to output only the content of a capturing group using grep.

2. What Is a Capturing Group?

In regular expressions, a capturing group is a part of the pattern enclosed in parentheses (). Capturing groups enables us to capture and extract specific parts of a matched pattern.

For example, let’s consider the following regular expression:

([0-9]+)-([a-zA-Z]+)

The above expression contains two capturing groups:

([0-9]+) – matches one or more digits in a sequence
([a-zA-Z]+) – matches one or more alphabetical characters, both lowercase and uppercase

The above expression matches strings consisting of one or more digits followed by a hyphen and one or more alphabetic characters — for example, “59-oranges” or “10-cars”.

Now, let’s explore how to use grep to output only the content of a capturing group.

3. Using grep to Extract the Content of a Capturing Group

grep doesn’t have a direct option to output only the content of a capturing group. However, by using regular expressions and the -o and -P options, we can achieve the desired outcome:

$ grep -oP 'pattern_with_groups' input_file

Let’s understand the above syntax:

-o – instructs grep to print only the matched parts of the matching line instead of the entire line
-P – enables the use of Perl-compatible regular expressions for pattern matching
pattern_with_groups – represents the pattern grep will search for in the input, including capturing groups enclosed in parentheses
input_file – represents the name of the file that grep will process

To illustrate, we’ll use a sample file named logs.txt:

$ cat logs.txt 
2022-01-24 08:15:23 [INFO] User "alice" logged in successfully.
2022-03-26 08:20:11 [ERROR] Database connection failed: Connection timed out.
2022-05-29 08:25:45 [WARNING] Invalid input received from IP address 192.168.1.100.
2022-06-13 08:30:02 [INFO] User "bob" updated profile information.
2022-07-22 08:35:19 [ERROR] File not found: /var/www/html/index.html.
2022-09-27 08:40:57 [INFO] User "charlie" created a new document.
2023-01-12 08:45:33 [DEBUG] Processing request from IP address 10.0.0.1.
2023-01-18 08:50:09 [ERROR] Authentication failed for user "david".
2023-01-23 08:12:01 [INFO] User 'john_doe' logged in from IP: 190.132.1.100
2023-02-13 08:15:32 [ERROR] Database connection failed: Timeout error
2023-02-20 08:20:45 [WARNING] Too many login attempts from IP: 203.0.113.50
2023-02-29 08:25:19 [INFO] User "jane_smith" updated profile information
2023-02-30 08:30:00 [ERROR] Server overload detected, please investigate
2023-03-02 08:35:12 [INFO] User "david_johnson" uploaded file: report.pdf
2023-03-10 08:40:58 [ERROR] Access denied for user 'emily_brown' from IP: 198.51.100.25
2023-03-22 08:45:23 [INFO] User "michael_lee" downloaded file: presentation.pptx

Above, we use cat to display the contents of the logs.txt file.

3.1. Extracting the Date

Here, let’s extract the dates from the sample file using a capturing group:

$ grep -oP '(\d{4}-\d{2}-\d{2})' logs.txt
2022-01-24
2022-03-26
...
2023-03-10
2023-03-22

Now, let’s explain the above command:

(\d{4}-\d{2}-\d{2}) – represents a capturing group that matches date strings in the format YYYY-MM-DD; \d{4}- matches exactly four digits, capturing the year followed by a hyphen; \d{2}- matches two digits, capturing the month followed by a hyphen; \d{2} matches two digits, capturing the day
logs.txt – represents the input file grep processes

The above command searches the logs.txt for lines that contain a date in the format YYYY-MM-DD. Once a match is found, we extract and print only the matched part of the line.

3.2. Extracting the Time

Let’s extract the time information from the sample file:

$ grep -oP '\b(\d{2}:\d{2}:\d{2})\b' logs.txt
08:15:23
08:20:11
...
08:40:58
08:45:23

Let’s examine the above command:

\b – represents a word boundary that ensures the pattern matches only complete words
(\d{2}:\d{2}:\d{2}) – represents a capturing group that matches the time in HH:MM:SS format; \d{2} matches two digits, and we use it three times to capture hours, minutes, and seconds while : matches the literal colon character
\b – ensures the pattern matches only complete words
logs.txt – represents the input file

Above, we search the logs.txt file and extract the time information for each log entry using a capturing group.

3.3. Extracting the Log Levels

Log levels such as INFO, ERROR, and WARNING in log entries are enclosed within square brackets. Here, we’ll use a capturing group to extract these log levels:

$ grep -oP '\[([A-Z]+)\]' logs.txt
[INFO]
[ERROR]
...
[ERROR]
[INFO]

Let’s break down the command:

\[ – matches an opening square bracket
([A-Z]+) – represents a capturing group that matches one or more uppercase letters; the + character matches one or more occurrences of the preceding character
\] – matches the closing bracket
logs.txt – represents the input file

We use the above command to extract all the log levels from the logs.txt file.

3.4. Extracting the IP Addresses

Since IP addresses follow a specific pattern of four sets of digits separated by a period. Using capturing groups, let’s extract the IP addresses in the logs.txt file:

$ grep -oP '\b(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\b' logs.txt
192.168.1.100
10.0.0.1
190.132.1.100
203.0.113.50
198.51.100.25

Let’s understand the above command:

\b – represents a word boundary that ensures we only match complete IP addresses and not parts of other words
(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) – represents a capturing group that matches an IP address; \d{1,3} matches one to three digits, which matches each part of the IP address; \. matches a dot that separates the parts of the IP address
logs.txt – represents the input file grep will process

Above, we use a capturing group to capture and extract all the IP addresses in the logs.txt file.

4. Handling Multiple Capturing Groups

We can extract multiple pieces of information from the logs.txt file using multiple capturing groups. To illustrate, let’s extract the year, time, and log levels:

$ grep -oP '(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) \[([A-Z]+)\]' logs.txt
2022-01-24 08:15:23 [INFO]
2022-03-26 08:20:11 [ERROR]
...
2023-03-10 08:40:58 [ERROR]
2023-03-22 08:45:23 [INFO]

Let’s examine this command:

(\d{4}-\d{2}-\d{2}) – matches and captures the date
(\d{2}:\d{2}:\d{2}) – used to match and capture the time
\[([A-Z]+)\] – matches and captures a string enclosed in square brackets that consists of one or more uppercase letters
logs.txt – the input file

Using the above command, we search the logs.txt file for lines containing a date in YYYY-MM-DD format, a time in HH:MM:SS format, and log levels enclosed in square brackets, all separated by spaces. We then extract and print only the captured parts.

5. Combining Grep With Other Commands

In this section, we’ll combine grep with other commands like awk and sed. Furthermore, they will help us manipulate the output and extract specific contents of a capturing group.

5.1. Using awk

awk is a command-line tool that enables us to perform complex pattern matching and manipulation. Here, we’ll use it with grep to extract and format content captured by a capturing group:

$ grep -oP '\b(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\b' logs.txt | awk -F '.' '{print $1}'
192
10
190
203
198

Let’s break down the command:

(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) – represents a capturing group that matches an IP address
| – redirects the output of grep as an input in the awk command
-F ‘.’ – used to set the field separator to a period, treating each part of the IP address as a separate field
‘{print $1}’ – instructs awk to print the first field of each line; since we set the field separator to a period, $1 represents the first set of numbers in each IP address

In the above command, we use grep to search the logs.txt file and extract the IP addresses. Next, we pipe the output to the awk command, which isolates and prints the first set of numbers from each IP address.

5.2. Using sed

sed is a command-line tool used to perform various operations, such as searching, replacing, on an input stream or files. In this case, we’ll combine it with grep to process and manipulate the capturing group output.

To demonstrate, let’s extract the year part from a date:

$ grep -oP '\b(\d{4})-(\d{2})-(\d{2})\b' logs.txt | sed 's/\([0-9]\{4\}\)-.*/\1/'
2022
2022
...
2023
2023

Let’s understand the command:

(\d{4})-(\d{2})-(\d{2}) – represents capturing groups that match date strings in the format YYYY-MM-DD
| – redirects the output of the grep command as an input to the sed command
s/ – indicates a substitution operation
$[0-9]\{4\}$ – represents a capturing group that matches and captures four consecutive digits, representing the year
-.* – matches a hyphen followed by any other characters
/\1/ – replaces the entire matched pattern with the contents of the capturing group

Above, we use grep to search the logs.txt file and extract dates in the format YYYY-MM-DD. Next, we pipe the output to the sed command, which extracts only the year from those dates and prints them out.

6. Conclusion

In this article, we explored how to use the grep command to output the content of a capturing group. Using capturing groups within regular expressions and grep’s -o and -P options, we extracted some information from the sample file. Lastly, we combined grep with the awk and sed commands.

Full Archive

About Baeldung

Administration

Filesystems

Processes

Files

Scripting

Installation

Networking

Security