Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: March 18, 2024
When performing analysis, we often need tools to check the collected data for the total number of occurrences for a given text pattern.
In this tutorial, we’ll learn how to search and count the text pattern occurrences within the content of files. Then, we’ll look at how to display the number of files that have a specific text pattern within them.
During this tutorial, we’re going to use the following complex pipeline command to search and count the pattern occurrences within the files under a given directory:
$ find . -name "*.txt" | xargs grep <options> "<pattern>" | wc <options>
As we can see, the above one-liner uses the find, grep, and wc commands. Depending on our goal, we can use different options for each of the three commands.
Let’s have a more detailed look at each part of this pipeline.
Firstly, the find command is used to find files in a given directory. For example, the first part of our pipeline is searching for all files ending with .txt in the current directory:
$ find . -name "*.txt"
./file1.txt
./file2.txt
./file3.txt
We can see that in our current directory, there are three text files named file1.txt, file2.txt, and file3.txt.
The next part of the pipeline runs grep via the xargs command:
$ xargs grep <options> <pattern>
It starts with xargs, which takes the output of the find command, and uses it as an argument for grep.
In other words, the grep command now searches for a pattern in every text file in the directory.
Finally, the wc command is used to convert the grep text output to a specific number based on the wc option provided.
For example, here the ls command returns a list of files in a directory, while the wc command counts them by line:
$ ls | wc -l
3
The output is three files in our case.
After we got familiar with the pipeline to use, we can count the exact text matches in a given dataset. For that, let’s assemble some text files and perform some preliminary analysis.
To that end, we’ll print the text files using the cat command:
$ cat file1.txt
hi abc
$ cat file2.txt
hi abc abc
abc
$ cat file3.txt
hi
We can see that file1.txt has one abc pattern in it, file2.txt has three abc patterns located on different lines, and file3.txt has no abc patterns. So, we have four abc pattern occurrences in a total of three files.
Using the dataset from above, let’s use the following command to count the matches of the abc pattern in our text files:
$ find . -name "*.txt" | xargs grep -oh abc | wc -w
4
Indeed, there are four matches in total, which is the correct result, as we saw earlier.
The grep command uses the options -o and -h. Option -o prints all matched patterns, while option -h prevents printing the filename of each found match.
On the other hand, the wc command uses -w to calculate the word count of the grep output.
Likewise, using the same dataset, we can find the number of files that have the text pattern abc in the current directory:
$ find . -name "*.txt" | xargs grep -l abc | wc -l
2
The result is 2 because only two files (file1.txt and file2.txt) out of the three contain the abc pattern.
The grep command uses the -l option, which prints the filename of each file that contains the pattern. Consequently, the wc command uses option -l to count the number of lines (matching files) in the grep output.
In this article, we learned how to find the total number of text occurrences in files in a directory. Then, we looked at how to count the number of files that contain a specific text pattern. For that, we used a pipeline with the find, grep, and wc commands.