How to Find and Extract Whole Words Containing a Substring

1. Overview

In Linux, working with text data is a common task for administrators and other users. For instance, it’s common practice to extract a specific piece of information from a text file. One such task involves extracting whole words containing a specific substring, which is very useful when analyzing or manipulating data.

In this tutorial, we’ll explore different methods we can use to find and extract whole words containing a specific substring from a file. To demonstrate, we’ll make use of grep, awk, sed, perl, and Python in the command line.

2. Sample File

To illustrate, we’ll use a sample file named Nature.txt as an example. First, let’s check the contents of the Nature.txt file with cat:

$ cat Nature.txt 
The autumn leaves floated in the crisp air, 
their vibrant hues a symphony of orange, red, and gold. 
They swirled and dipped, carried by the playful breeze, 
whispering secrets to the ancient trees.

Next, we’ll use the above file to extract whole words containing a specified substring in the upcoming sections.

3. Using the grep Command

grep is a command-line utility that searches for a specific pattern or words within a file. Here, we’ll use it to extract words containing a specific substring. In addition, we’ll use grep in combination with regular expressions.

Now, let’s use grep to extract all words containing the substring “oot”:

$ grep -o '\w*oot\w*' Nature.txt 
underfoot
hooting

The -o option instructs grep to only output the matched parts of the line instead of the entire line.

Next, \w*oot\w* represents a regular expression that matches any word that contains the substring “oot” and word characters before and after the substring.

We can also use an alternative approach:

$ grep -o '\<[[:alpha:]]*ance[[:alpha:]]*\>' Nature.txt 
danced
distance

Let’s examine this command:

\< and \> – used to match the beginning and end of a word ensuring we get the whole word
[[:alpha:]]* – matches any alphabetical characters (letters) before and after the substring
*ance – represents the specific substring we’re looking for
Nature.txt – the name of the file in which we’re searching for the specified pattern

The above command extracts all the words that contain the “ance” substring in our input file.

4. Using the awk Command

awk is a powerful tool used for pattern scanning and extracting text. It allows us to specify patterns and actions to perform on the input data and processes text line by line.

Now, let’s extract all words containing the substring “ai”:

$ awk '{ for (i=1; i<=NF; i++) if ($i ~ /ai/) print $i }' Nature.txt
air,
tail
against

Let’s understand the above command:

for (i=1; i<=NF; i++) – represents a loop that iterates through each word in a line
if ($i ~ /ai/) – represents a condition that checks if the current word ($i) contains the substring “ai”
print $i – used to print out the current word if the condition is true
Nature.txt – represents the input file that awk is processing

Above, we extract all the words that contain the substring “ai” in the Nature.txt file.

Furthermore, we can sort the output in ascending or descending order:

$ awk '{ for (i=1; i<=NF; i++) if ($i ~ /ai/) print $i }' Nature.txt | sort
against
air,
mosaic

In the above example, we pipe the output of awk to the sort command, which then sorts the output in ascending order.

5. Combining grep With the awk Command

Here, we’ll use grep to filter out words containing the substring and awk to extract all the words:

$ grep 'ie' Nature.txt | awk '{for(i=1;i<=NF;i++) if($i ~ /ie/) print $i}'
carried
ancient

First, we use the grep ‘ie’ Nature.txt command to filter out the lines in the Nature.txt file that contain the substring “ie”.

Then, we pipe the output of the grep command as input to awk.

Next, the awk ‘{for(i=1;i<=NF;i++) if($i ~ /ie/) print $i}’ command iterates through each word in each line and only prints the words that contain the substring “ie”.

6. Using the sed Command

sed is a powerful command-line tool used to parse and perform text transformations on an input stream. It allows us to substitute and manipulate text. In this example, we’ll use it to extract whole words containing the substring “ir”:

$ sed -e 's/ /\n/g' Nature.txt | sed -n '/ir/p'
air,
their

Let’s break down the above command:

-e – specifies the script to be executed
‘s/ /\n/g’ – represents the substitution expression used to replace each space with a new line
Nature.txt – this is the input file
| – pipes the output of the first sed command and passes it to the next command
-n – instructs sed to print only the lines that match a specific pattern
‘/ir/ – represents a regular expression pattern that matches lines containing the substring “ir”
p’ – used to print a line if it matches the pattern

Using the above command, we print out all the words in our input file that contain the substring “ir”.

Furthermore, we can arrange the results in ascending or descending order:

$ sed -e 's/ /\n/g' Nature.txt | sed -n '/ir/p' | sort
air,
squirrel
swirled

Above, we pipe the sed results to the sort command, which sorts the results in ascending order.

Additionally, we can save the results to an output file:

$ sed -e 's/ /\n/g' Nature.txt | sed -n '/ir/p' | sort > extracted_words.txt

Here, we use the > operator to redirect our results to a file named extracted_words.txt.

7. Combining the grep, awk, and sed Commands

We’ll use piping to combine the commands:

$ grep 'own' Nature.txt | awk '{for(i=1;i<=NF;i++) if($i ~ /own/) print $i}' | sed 's/[^a-zA-Z0-9]//g'
crowned
down

Above, we use grep to search and filter out lines that contain the substring “own” in our input file.

We then pipe the output of grep to awk, which processes each line of the output and prints out the words that match the pattern “own”.

Finally, we pipe the output of the awk command to sed, which removes any characters that are not alphanumeric, replacing them with an empty string.

8. Using perl

perl enables us to run scripts from the command line and perform various tasks like complex text manipulation. Its support for regular expressions allows us to use complex patterns to extract various text patterns.

Now, let’s explore how to use perl to extract whole words containing the substring “nt”:

$ perl -nle 'print for /\b\w*nt\w*\b/g' Nature.txt
vibrant
ancient
testament

Let’s understand the above command:

-n – tells perl to process the input file line by line
-l – used to add a newline character to each print statement
e – allows us to specify the code directly on the command line
print for – prints each match found by the regular expression
/\b\w*nt\w*\b/g – represents the regular expression we’re using to look for words containing the substring “nt”
Nature.txt – represents the file we want to process using the perl command

Above, we use a regular expression to match whole words containing the substring “nt” in our input file and then print them out in our terminal.

In addition, we can sort the output alphabetically and save it to an output file:

$ perl -nle 'print for /\b\w*nt\w*\b/g' Nature.txt | sort > extracted_words.txt

Here, we use the sort command to arrange the output alphabetically while > extracted_words.txt redirects the sorted output to a file named extracted_words.txt.

9. Using python

Python is a high-level programming language used to write scripts and automate tasks. In this case, we’ll use it to extract a word from a substring using the re-module. For this to work, we need to have the latest version of Python installed on our system.

On Ubuntu/Debian-based distribution, we’ll use the apt package manager. First, we’ll update the package list to get the latest information about the available packages:

$ sudo apt update

Next, let’s install the latest version of Python:

$ sudo apt install python3

Alternatively, on Fedora, we’ll use dnf:

$ sudo dnf install python3

Lastly, on Arch Linux, we’ll use pacman:

$ sudo pacman -S python

Once Python is installed, we can go ahead and extract words containing the substring “an”:

$ python3 -c "import re; print('\n'.join(re.findall(r'\b\w*an\w*\b', open('Nature.txt').read())))"
danced
vibrant

Let’s take a look at each part of our Python script:

python3 – executes the Python 3 interpreter
-c – allows us to provide a command directly as a string instead of writing it as a separate script
import re – used to import the regular expression (regex) module
open(‘Nature.txt’).read() – opens the Nature.txt file and reads its content
re.findall(r’\b\w*an\w*\b’ …) – searches for all words containing the substring “an” in the text using a regular expression pattern
\n’.join(…) – joins the matched words into a single string, with each word on a new line
print(…) – prints out the matched words

Above, we utilize python3 to extract all words that contain the substring “an” in our input file and print them on our terminal.

10. Conclusion

In this article, we discussed extracting whole words containing a specific substring from a file in Linux using grep, awk, sed, perl, and python. Additionally, these methods use regular expressions to identify words that match a specific pattern. We can choose any of these approaches to use based on our preference.

Full Archive

About Baeldung

Administration

Filesystems

Processes

Files

Scripting

Installation

Networking

Security