1. Overview

In this tutorial, we’ll cover how we can search for text that matches a multi-line pattern in a file. We’ll use various tools such as grep and awk, which are easily accessible in Linux.

Moreover, we’ll be using the contents of the journald log as an example throughout the tutorial.

2. Using grep

The grep utility is one that we can use to find patterns in a text file or standard input. Most Linux distros come with grep pre-installed. However, we can also install it from the distro’s official repository.

The general syntax of grep is pretty straightforward:

grep [OPTIONS] PATTERNS [FILE]

As an example, let’s search for “dhcpc6” in the log file generated by journald:

$ grep dhcp6 log
Dec 23 23:25:51 hey NetworkManager[481]: <info>  [1640283951.9930] dhcp6 (enp5s0): activation: beginning transaction (timeout in 45 seconds)
Dec 23 23:26:37 hey NetworkManager[481]: <warn>  [1640283997.0312] dhcp6 (enp5s0): request timed out
...

There are different variants of grep that are built for different purposes. First, we’ll take a look at the standard grep to find multi-line patterns in a file, and then, we’ll move on to use pcregrep.

2.1. Using grep with the -P Option

The problem with using grep‘s regular expression is that the pattern is limited to only a single line. While it’s possible to use grep multiple times to achieve the required result, it’s more convenient to use the -P or –perl-regexp option. The -P option enables the PCRE add-on for grep. The Perl-Compatible Regular Expression (PCRE) add-on enables us to provide Perl’s regular expressions to grep. Therefore, it allows us to do more magic with pattern searching.

Let’s say we want to find out the last entry of December 24 and the first entry of December 25 in the journald‘s log file. We can do so with grep as:

$ grep -Pzo 'Dec 24.*\nDec 25.*\n' log
Dec 24 23:53:31 hey rtkit-daemon[447]: Supervising 8 threads of 5 processes of 1 users.
Dec 25 00:00:31 hey systemd[1]: Starting Daily man-db regeneration...

Let’s break this command down:

  • The -P option enables the PCRE add-on for grep
  • The -z option treats the matched text as a sequence of lines by adding the NUL character to each line’s ending
  • The -o option enables grep to print only the matched text and ignore trailing spaces
  • The pattern signifies that our desired text should start with Dec 24
  • .* in the regular expression will match any character until it reaches a NUL character
  • .*\n in our case matches any character after Dec 24 until it reaches the newline

Now, when we combine this regular expression with Dec 25.*\n, we direct grep to match any line inside the log file that starts with Dec 24 until the end of the line, followed by the next immediate line that starts with Dec 25 and ends with a newline. Therefore, we can see two different entries from the log file that matched this pattern.

2.2. Using pcregrep

The pcregrep utility is a grep variant that uses libpcre exclusively. The libpcre library is what powers the Perl 5 regular expressions. The advantage of using pcregrep is that it’s slightly faster than grep, and we don’t have to refine the matched text with the -Pzo options.

We can enable the multi-line search pattern in pcregrep by providing the -M option. As an example, let’s find the entries of December 25 between 17:10:16 and 17:44:39 in the log file:

$ pcregrep -M 'Dec 25 17:10:16.*(\n|.)*Dec 25 17:44:39' log
Dec 25 17:10:16 hey NetworkManager[259]: <warn>  [1640434216.0663] dhcp6 (enp5s0): request timed out
Dec 25 17:10:16 hey NetworkManager[259]: <info>  [1640434216.0664] dhcp6 (enp5s0): state changed unknown -> timeout
...
Dec 25 17:44:39 hey NetworkManager[259]: <info>  [1640436279.6494] manager: NetworkManager state is now CONNECTED_GLOBAL

As we can see, the pattern is pretty much the same except the (\n|.)* part, which basically matches every character and newline between the two given times.

3. Using awk

AWK is a Domain-Specific Language (DSL) for text extraction. It’s a standard UNIX feature that comes with most Linux distros. Some of the reasons we choose awk over grep is that it is much more powerful, feature-rich, and sometimes, a lot faster than grep. So, when something isn’t easily doable with grep, we move on to awk.

The basic usage of the awk command is:

awk [OPTIONS] '/pattern/ { action $COLUMN }' [FILE]

We match text based on a specific pattern and then take further processing actions on the matched text. For instance, we can extract the total physical memory from the free command:

$ free
      total   used    free    shared  buff/cache  available
Mem:  8058024 3414968 1713536 61564   2929520     4273328
$ free | awk '/Mem:/ { print $2 }'
8058024

Here, awk matched the pattern Mem: and then printed the second (non-empty) column of the matched line.

So, how do we search for text that matches a multi-line pattern? Well, we can use the same syntax for multi-line pattern by adding a starting and ending pattern:

awk [OPTIONS] '/start/,/end/ { action $COLUMN }' [FILE]

Mind that comma that separates both the patterns. Now, let’s print all the kernel logs from December 25 between 17:00:00 and 18:00:00:

$ awk '/Dec 25 17:??:??/,/Dec 25 18:??:??/' log | grep kernel
Dec 25 17:09:19 hey kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
...
Dec 25 19:30:39 hey kernel: perf: interrupt took too long (3966 > 3952), lowering kernel.perf_event_max_sample_rate to 50400

The ? in the regular expression matches exactly one character. Since awk searches the file linearly, the pattern should match the lines that start from the time 17:00:00 to 18:00:00 in sequence. We then hand over the output to the grep command, which prints out the lines with the keyword “kernel” in them. Of course, we could have used an awk action to do the job of grep, but it makes more sense to use grep to avoid making the expression more complex.

4. Conclusion

In this tutorial, we saw how we could use Linux’s built-in tools to carry out multi-line pattern searching. We covered several tools such as GNU Grep, PCREGrep, and AWK for that purpose.

Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.