Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: February 20, 2024
In Linux, working with text data is a common task for administrators and other users. For instance, it’s common practice to extract a specific piece of information from a text file. One such task involves extracting whole words containing a specific substring, which is very useful when analyzing or manipulating data.
In this tutorial, we’ll explore different methods we can use to find and extract whole words containing a specific substring from a file. To demonstrate, we’ll make use of grep, awk, sed, perl, and Python in the command line.
To illustrate, we’ll use a sample file named Nature.txt as an example. First, let’s check the contents of the Nature.txt file with cat:
$ cat Nature.txt
The autumn leaves floated in the crisp air,
their vibrant hues a symphony of orange, red, and gold.
They swirled and dipped, carried by the playful breeze,
whispering secrets to the ancient trees.
Next, we’ll use the above file to extract whole words containing a specified substring in the upcoming sections.
grep is a command-line utility that searches for a specific pattern or words within a file. Here, we’ll use it to extract words containing a specific substring. In addition, we’ll use grep in combination with regular expressions.
Now, let’s use grep to extract all words containing the substring “oot”:
$ grep -o '\w*oot\w*' Nature.txt
underfoot
hooting
The -o option instructs grep to only output the matched parts of the line instead of the entire line.
Next, \w*oot\w* represents a regular expression that matches any word that contains the substring “oot” and word characters before and after the substring.
We can also use an alternative approach:
$ grep -o '\<[[:alpha:]]*ance[[:alpha:]]*\>' Nature.txt
danced
distance
Let’s examine this command:
The above command extracts all the words that contain the “ance” substring in our input file.
awk is a powerful tool used for pattern scanning and extracting text. It allows us to specify patterns and actions to perform on the input data and processes text line by line.
Now, let’s extract all words containing the substring “ai”:
$ awk '{ for (i=1; i<=NF; i++) if ($i ~ /ai/) print $i }' Nature.txt
air,
tail
against
Let’s understand the above command:
Above, we extract all the words that contain the substring “ai” in the Nature.txt file.
Furthermore, we can sort the output in ascending or descending order:
$ awk '{ for (i=1; i<=NF; i++) if ($i ~ /ai/) print $i }' Nature.txt | sort
against
air,
mosaic
In the above example, we pipe the output of awk to the sort command, which then sorts the output in ascending order.
Here, we’ll use grep to filter out words containing the substring and awk to extract all the words:
$ grep 'ie' Nature.txt | awk '{for(i=1;i<=NF;i++) if($i ~ /ie/) print $i}'
carried
ancient
First, we use the grep ‘ie’ Nature.txt command to filter out the lines in the Nature.txt file that contain the substring “ie”.
Then, we pipe the output of the grep command as input to awk.
Next, the awk ‘{for(i=1;i<=NF;i++) if($i ~ /ie/) print $i}’ command iterates through each word in each line and only prints the words that contain the substring “ie”.
sed is a powerful command-line tool used to parse and perform text transformations on an input stream. It allows us to substitute and manipulate text. In this example, we’ll use it to extract whole words containing the substring “ir”:
$ sed -e 's/ /\n/g' Nature.txt | sed -n '/ir/p'
air,
their
Let’s break down the above command:
Using the above command, we print out all the words in our input file that contain the substring “ir”.
Furthermore, we can arrange the results in ascending or descending order:
$ sed -e 's/ /\n/g' Nature.txt | sed -n '/ir/p' | sort
air,
squirrel
swirled
Above, we pipe the sed results to the sort command, which sorts the results in ascending order.
Additionally, we can save the results to an output file:
$ sed -e 's/ /\n/g' Nature.txt | sed -n '/ir/p' | sort > extracted_words.txt
Here, we use the > operator to redirect our results to a file named extracted_words.txt.
We’ll use piping to combine the commands:
$ grep 'own' Nature.txt | awk '{for(i=1;i<=NF;i++) if($i ~ /own/) print $i}' | sed 's/[^a-zA-Z0-9]//g'
crowned
down
Above, we use grep to search and filter out lines that contain the substring “own” in our input file.
We then pipe the output of grep to awk, which processes each line of the output and prints out the words that match the pattern “own”.
Finally, we pipe the output of the awk command to sed, which removes any characters that are not alphanumeric, replacing them with an empty string.
perl enables us to run scripts from the command line and perform various tasks like complex text manipulation. Its support for regular expressions allows us to use complex patterns to extract various text patterns.
Now, let’s explore how to use perl to extract whole words containing the substring “nt”:
$ perl -nle 'print for /\b\w*nt\w*\b/g' Nature.txt
vibrant
ancient
testament
Let’s understand the above command:
Above, we use a regular expression to match whole words containing the substring “nt” in our input file and then print them out in our terminal.
In addition, we can sort the output alphabetically and save it to an output file:
$ perl -nle 'print for /\b\w*nt\w*\b/g' Nature.txt | sort > extracted_words.txt
Here, we use the sort command to arrange the output alphabetically while > extracted_words.txt redirects the sorted output to a file named extracted_words.txt.
Python is a high-level programming language used to write scripts and automate tasks. In this case, we’ll use it to extract a word from a substring using the re-module. For this to work, we need to have the latest version of Python installed on our system.
On Ubuntu/Debian-based distribution, we’ll use the apt package manager. First, we’ll update the package list to get the latest information about the available packages:
$ sudo apt update
Next, let’s install the latest version of Python:
$ sudo apt install python3
Alternatively, on Fedora, we’ll use dnf:
$ sudo dnf install python3
Lastly, on Arch Linux, we’ll use pacman:
$ sudo pacman -S python
Once Python is installed, we can go ahead and extract words containing the substring “an”:
$ python3 -c "import re; print('\n'.join(re.findall(r'\b\w*an\w*\b', open('Nature.txt').read())))"
danced
vibrant
Let’s take a look at each part of our Python script:
Above, we utilize python3 to extract all words that contain the substring “an” in our input file and print them on our terminal.
In this article, we discussed extracting whole words containing a specific substring from a file in Linux using grep, awk, sed, perl, and python. Additionally, these methods use regular expressions to identify words that match a specific pattern. We can choose any of these approaches to use based on our preference.