Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: October 1, 2024
Text processing is a part of many common tasks for system administrators and users of Linux. These tasks may include analyzing log files, filtering data, or searching through text. Sometimes, we may encounter situations where the text contains multiple spaces making it harder to search for and extract data. This can occur due to issues such as data entry errors or formatting issues.
In this tutorial, we’ll explore matching words and ignoring multiple spaces using the grep and tr commands.
To demonstrate, we use adventure.txt as the sample file:
$ cat adventure.txt
A long time ago, there lived an old owl named lily.
The owl resided in an ancient oak tree, surrounded by a peaceful forest.
Every night, the owl would hoot softly, sharing stories with the moon and stars.
Villagers from nearby would sometimes visit the owl to seek wisdom.
Despite the darkness, the owl's eyes would shine brightly, illuminating the path.
One such night, a young traveler lost his way in the forest.
...
In the upcoming sections, the above file serves to illustrate how to match specific words and ignore multiple spaces.
tr is a command line utility that translates or deletes characters from an input stream and writes the result to an output stream. Furthermore, we can use it to replace multiple spaces with a single space normalizing the spacing within a file.
The command follows a basic syntax:
$ tr -s ' ' < inputfile > outputfile
Let’s break it down:
Let’s use the above syntax to replace all multi-space occurrences each with a single space in the sample file:
$ tr -s ' ' < adventure.txt > adventure_normalized.txt
Using this command, we read the content of the adventure.txt file and replace any sequences of multiple spaces with a single space. We then save the result into a new file named adventure_normalized.txt.
Normalizing the spaces makes it often easier to search for a pattern using grep.
Furthermore, we can also replace all occurrences of multiple tabs in a file as well:
$ tr -s '\t ' ' ' < adventure.txt > adventure_normalized.txt
The above command replaces all sequences of multiple tabs and spaces with a single space in the adventure.txt file.
grep is a command line utility used to search for text using patterns. It scans files or an input stream and prints lines matching a specific pattern. Additionally, grep supports regular expressions making it easier to perform complex searches.
Now, let’s search for a specific word in the now-normalized text:
$ grep "oak" adventure_normalized.txt
The owl resided in an ancient oak tree, surrounded by a peaceful forest.
Hearing the owl's gentle hoot, he followed the sound until he reached the oak tree.
Here, we search through the adventure_normalized.txt file and print all lines that contain the word oak.
Since grep is case-sensitive by default, we can perform a case-insensitive search using the -i option:
$ grep -i "oak" adventure_normalized.txt
Above, we perform a case-insensitive search for oak in the adventure_normalized.txt file matching words such as oak, Oak, or OAK.
Furthermore, we can search for multiple words in the normalized text. For instance, let’s search for lines containing either the word oak or owl:
$ grep -E 'oak|owl' adventure_normalized.txt
A long time ago, there lived an old owl named lily.
The owl resided in an ancient oak tree, surrounded by a peaceful forest.
Every night, the owl would hoot softly, sharing stories with the moon and stars.
...
In the example above, we search for lines in the adventure_normalized.txt file that contain one of two alternative words. Specifically, the -E option enables extended regular expressions enabling the use of the pipe symbol | as an alternation operator to specify multiple patterns or words.
We can also display the line numbers along with the matching lines using the -n option:
$ grep -n 'wise' adventure_normalized.txt
8:There, under the starry sky, he found guidance and comfort, forever grateful to the wise old owl.
Thus, the above command searches and displays all lines containing the word-wise along with their line numbers.
To match whole words only, we use the -w option:
$ grep -w 'hoot' adventure_normalized.txt
Every night, the owl would hoot softly, sharing stories with the moon and stars.
Hearing the owl's gentle hoot, he followed the sound until he reached the oak tree.
In the example above, we match hoot as an entire word, not as part of another word.
To search for complex patterns, we can use regular expressions. For example, let’s match any line containing the word owl followed by a specific pattern:
$ grep 'owl *resided' adventure_normalized.txt
The owl resided in an ancient oak tree, surrounded by a peaceful forest.
Using the above command, we search for lines in the adventure_normalized.txt file where the word owl is followed by zero or more spaces and then the word resided.
Instead of displaying the lines where a specific word appears, we might want to know how many lines contain the matched word. To count the number of matching lines, we use the -c option. This can be useful for analysis or summarizing search results.
To illustrate, let’s count how many lines contain the word oak in the sample file:
$ grep -c 'oak' adventure_normalized.txt
2
In the above command, the output of 2 indicates the word oak matches on two different lines. To clarify, if a word appears multiple times on a single line, we count it as one match.
Sometimes we may want to exclude lines that contain a specific word or pattern. By using the -v option, we can invert the match showing only lines that don’t contain the matched word or pattern:
$ grep -v 'hoot' adventure_normalized.txt
A long time ago, there lived an old owl named lily.
The owl resided in an ancient oak tree, surrounded by a peaceful forest.
Villagers from nearby would sometimes visit the owl to seek wisdom.
...
In this example, we search through the adventure_normalized.txt file and print out all lines that don’t contain the word hoot.
Ultimately, we can combine the grep and tr commands to match words while ignoring multiple spaces effectively:
$ cat adventure.txt | tr -s ' ' | grep "forest"
The owl resided in an ancient oak tree, surrounded by a peaceful forest.
One such night, a young traveler lost his way in the forest.
Here, we use the cat command to read the contents of the adventure.txt file and output them to the standard output. Next, we pipe the output as input to the tr command which replaces each occurrence of multiple spaces with a single space.
Lastly, we pipe the output from tr to the grep command, which searches for lines containing the word forest and prints them.
This way, we can use any of the grep functions discussed earlier to apply a comprehensive filter, which ignores multiple spaces.
In the article, we discussed matching words and ignoring multiple spaces using the grep and tr commands.
First, we replaced every multi-space occurrence with a single space in a specific file using tr. After that, we utilized grep and some of its options to match specific words.
Finally, we combined the grep and tr commands for quick processing and searches.