1. Introduction

Text processing tasks often involve finding and extracting substrings with a specific pattern from a set of files with structured or unstructured data. The substring can be between or outside the quotes.

In this tutorial, we’ll learn different ways to use the grep command to find a substring between quotes.

2. Sample Dataset

To illustrate the different ways to extract the substring between quotes, we’ll use the following file named sample.txt:

$ cat sample.txt 
"a lazy dog"
a lazy dog
"the energetic dog"
the energetic dog 
"When zombies arrive"
When zombies arrive

3. Find a Substring Anywhere Between the Quotes

We’ll start with finding a substring between the quotes. Let’s try with a basic command to describe the pattern of our target substring:

$ grep '".*lazy.*"' sample.txt
"a lazy dog" 

In this example, we’re looking for the word lazy present between the quotes. Let’s take a closer look at the command:

  • grep is the command used to find patterns in files
  • the pattern to match is present between single quotes (‘ ‘ )
  • double quotes (” ” ) match the double quote character at the beginning and end of the pattern
  • .* matches any character except newline zero or more times
  • sample.txt is the file we want to search for patterns

We can also use Perl-compatible regular expressions to find matches for our target substring. This is illustrated in the following example:

$ grep -P '"[^"]+lazy[^"]+"' sample.txt
"a lazy dog" 

Both of the above examples give the same result with different options, but the latter one is more specific as this regular expression ([^”]+) matches one or more characters that are not a double quote (). The -P option enables the use of Perl-compatible regular expressions. Thus, in this example, we are searching for a substring lazy between the double quotes and surrounded by one or more characters that are not a double quote.

We can remove the + after the square bracket to match exactly one non-double quote character. Let’s try one example where we have multiple characters before our substring and just one character afterward:

$ grep -P '"[^"]+zombies[^"]"' sample.txt
"When zombies arrive"

The grep command, by default, prints the complete lines containing the substring. But we can use the -o option to print only parts of lines that match our pattern.

4. Find a Substring at the Beginning or End of a Quote

We can use regular expressions to find the substring from the start or end of a quote. Below is an example of this:

$ grep -P '"[^"]+arrive"' sample.txt
"When zombies arrive"

Here, we used the regular expression to find a substring at the end of a quote. Similarly, if we remove the [^”]+ from the beginning of our substring and add it afterward, the command will search for the substring at the beginning of a quote.

5. Positive Lookbehind and Lookahead Assertions in grep

Lookbehind assertion resets the start of the match, excluding the preceding pattern. Let’s understand this with an example:

$ grep -oP '"a \Klazy[^"]+"' sample.txt
lazy dog"

This command finds the substring a lazy within the quotes and uses a lookbehind assertion (\K) just before the word lazy that resets the output, excluding the pattern preceding the lookbehind assertion.

Let’s take a closer look at the new option used in this command:

  • -o is used only to display the matching part of the line.
  • \K is a lookbehind assertion that resets the start of the match, excluding the preceding pattern from the final output.

Similarly, we can also use the lookahead assertion to exclude the succeeding pattern from the match. Below is an example showing the use of lookahead assertion in the grep command:

$ grep -oP '"[^"]+lazy (?=dog")' sample.txt
"a lazy 

The lookahead assertion in this example (?=dog”) matches only if the next characters are dog”, without including them in the match itself. Thus, we can see the final output of this command excludes the pattern dog”.

These lookbehind and lookahead assertions play an important role when we need to distinguish between two substrings, like a lazy dog and the energetic dog. Let’s say we want to print only dog” from all occurrences of the energetic dog:

$ grep -oP '"the energetic \Kdog"' sample.txt
dog"

This will print only dog” from all the occurrences of the energetic dog in the sample.txt file.

6. Conclusion

In this article, we learned how to find substrings within quotes using the grep command. Moreover, we discussed the usage of positive lookbehind and lookahead assertions in the grep command to exclude the start and end of the substring, respectively.

Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.