Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: April 14, 2024
grep is a command-line tool used for searching text within files or an input stream. Furthermore, it enables us to search for specific patterns within files and filter out matching lines.
When we need to extract specific information using grep, we use regular expressions, which enable us to define complex patterns for searching and extracting specific information from text. In regular expressions, we can use capturing groups to isolate and extract specific parts of a matched pattern.
In this tutorial, we’ll discuss how to output only the content of a capturing group using grep.
In regular expressions, a capturing group is a part of the pattern enclosed in parentheses (). Capturing groups enables us to capture and extract specific parts of a matched pattern.
For example, let’s consider the following regular expression:
([0-9]+)-([a-zA-Z]+)
The above expression contains two capturing groups:
The above expression matches strings consisting of one or more digits followed by a hyphen and one or more alphabetic characters — for example, “59-oranges” or “10-cars”.
Now, let’s explore how to use grep to output only the content of a capturing group.
grep doesn’t have a direct option to output only the content of a capturing group. However, by using regular expressions and the -o and -P options, we can achieve the desired outcome:
$ grep -oP 'pattern_with_groups' input_file
Let’s understand the above syntax:
To illustrate, we’ll use a sample file named logs.txt:
$ cat logs.txt
2022-01-24 08:15:23 [INFO] User "alice" logged in successfully.
2022-03-26 08:20:11 [ERROR] Database connection failed: Connection timed out.
2022-05-29 08:25:45 [WARNING] Invalid input received from IP address 192.168.1.100.
2022-06-13 08:30:02 [INFO] User "bob" updated profile information.
2022-07-22 08:35:19 [ERROR] File not found: /var/www/html/index.html.
2022-09-27 08:40:57 [INFO] User "charlie" created a new document.
2023-01-12 08:45:33 [DEBUG] Processing request from IP address 10.0.0.1.
2023-01-18 08:50:09 [ERROR] Authentication failed for user "david".
2023-01-23 08:12:01 [INFO] User 'john_doe' logged in from IP: 190.132.1.100
2023-02-13 08:15:32 [ERROR] Database connection failed: Timeout error
2023-02-20 08:20:45 [WARNING] Too many login attempts from IP: 203.0.113.50
2023-02-29 08:25:19 [INFO] User "jane_smith" updated profile information
2023-02-30 08:30:00 [ERROR] Server overload detected, please investigate
2023-03-02 08:35:12 [INFO] User "david_johnson" uploaded file: report.pdf
2023-03-10 08:40:58 [ERROR] Access denied for user 'emily_brown' from IP: 198.51.100.25
2023-03-22 08:45:23 [INFO] User "michael_lee" downloaded file: presentation.pptx
Above, we use cat to display the contents of the logs.txt file.
Here, let’s extract the dates from the sample file using a capturing group:
$ grep -oP '(\d{4}-\d{2}-\d{2})' logs.txt
2022-01-24
2022-03-26
...
2023-03-10
2023-03-22
Now, let’s explain the above command:
The above command searches the logs.txt for lines that contain a date in the format YYYY-MM-DD. Once a match is found, we extract and print only the matched part of the line.
Let’s extract the time information from the sample file:
$ grep -oP '\b(\d{2}:\d{2}:\d{2})\b' logs.txt
08:15:23
08:20:11
...
08:40:58
08:45:23
Let’s examine the above command:
Above, we search the logs.txt file and extract the time information for each log entry using a capturing group.
Log levels such as INFO, ERROR, and WARNING in log entries are enclosed within square brackets. Here, we’ll use a capturing group to extract these log levels:
$ grep -oP '\[([A-Z]+)\]' logs.txt
[INFO]
[ERROR]
...
[ERROR]
[INFO]
Let’s break down the command:
We use the above command to extract all the log levels from the logs.txt file.
Since IP addresses follow a specific pattern of four sets of digits separated by a period. Using capturing groups, let’s extract the IP addresses in the logs.txt file:
$ grep -oP '\b(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\b' logs.txt
192.168.1.100
10.0.0.1
190.132.1.100
203.0.113.50
198.51.100.25
Let’s understand the above command:
Above, we use a capturing group to capture and extract all the IP addresses in the logs.txt file.
We can extract multiple pieces of information from the logs.txt file using multiple capturing groups. To illustrate, let’s extract the year, time, and log levels:
$ grep -oP '(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) \[([A-Z]+)\]' logs.txt
2022-01-24 08:15:23 [INFO]
2022-03-26 08:20:11 [ERROR]
...
2023-03-10 08:40:58 [ERROR]
2023-03-22 08:45:23 [INFO]
Let’s examine this command:
Using the above command, we search the logs.txt file for lines containing a date in YYYY-MM-DD format, a time in HH:MM:SS format, and log levels enclosed in square brackets, all separated by spaces. We then extract and print only the captured parts.
In this section, we’ll combine grep with other commands like awk and sed. Furthermore, they will help us manipulate the output and extract specific contents of a capturing group.
awk is a command-line tool that enables us to perform complex pattern matching and manipulation. Here, we’ll use it with grep to extract and format content captured by a capturing group:
$ grep -oP '\b(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\b' logs.txt | awk -F '.' '{print $1}'
192
10
190
203
198
Let’s break down the command:
In the above command, we use grep to search the logs.txt file and extract the IP addresses. Next, we pipe the output to the awk command, which isolates and prints the first set of numbers from each IP address.
sed is a command-line tool used to perform various operations, such as searching, replacing, on an input stream or files. In this case, we’ll combine it with grep to process and manipulate the capturing group output.
To demonstrate, let’s extract the year part from a date:
$ grep -oP '\b(\d{4})-(\d{2})-(\d{2})\b' logs.txt | sed 's/\([0-9]\{4\}\)-.*/\1/'
2022
2022
...
2023
2023
Let’s understand the command:
Above, we use grep to search the logs.txt file and extract dates in the format YYYY-MM-DD. Next, we pipe the output to the sed command, which extracts only the year from those dates and prints them out.
In this article, we explored how to use the grep command to output the content of a capturing group. Using capturing groups within regular expressions and grep’s -o and -P options, we extracted some information from the sample file. Lastly, we combined grep with the awk and sed commands.