Authors Top

If you have a few years of experience in the Linux ecosystem, and you’re interested in sharing that experience with the community, have a look at our Contribution Guidelines.

1. Introduction

In this tutorial, we’ll show how to count words in a file. We can do this using tools such as wc (word count), sed (stream editor), and vim (visual editor). The commands discussed in this article are compatible with the major Linux shells: bash, sh, csh, ksh, and zsh.

2. What Is a Word?

One definition we can use is a non-zero sequence of characters delimited by spaces, tabs, or newlines. However, not all natural-language words fit this definition.

Let’s take this sentence as an example:

Mr. Jones--a kind man--helped us considerably.

As per our definition, it has 6 words. However, the rules of the English language find eight words there. So, we need to pay special attention to this and similar exceptions when counting words.

3. The wc Command

This command can find out the number of lines, words, and characters in a text.

For instance, let’s suppose we have a file example1.txt:

$ cat example1.txt
Some flowers, e.g. the one right there, are poisonous.

If we feed this file’s content to wc, we’ll get:

$ wc < example1.txt
 1 9 55

We see that wc outputs the number of lines (1), words (9), and characters (55). As we are only interested in the number of words, we can use the -w option:

$ wc -w < example1.txt
9

This is the expected answer, as long as we consider “e.g.” to be one word.

Let’s now look at a slightly different file:

$ cat example2.txt
Some flowers , e.g. the one right there, are poisonous.

When we apply wc, we get:

$ wc -w < example2.txt
10

The answer is off by one. That’s because of the stray comma, which wc counts as a separate word.

4. Using sed to Account for Punctuation

We can get correct results by combining wc with sed. The latter takes input from a file, applies one or more commands to it, and outputs the result.

The idea is to use substitution commands to account for problematic characters. In our example, the command will be ‘s/ *,/ /g’, which effectively deletes whitespaces before a comma.

Also, we use the -e flag to inform sed that a command follows. If we have several commands, we’ll need to precede each by -e.

After substitution, we forward sed comand’s output to wc -w:

$ sed -e 's/ *,/,/g' example2.txt | wc -w
9

Now, we get the correct result.

This example shows how to cover a single exception to our definition of a word. To cover multiple exceptions, we’ll use a more complex regex and/or multiple substitution commands.

5. Using the vim Editor

Finally, we can count words with the vim editor. Let’s open the file example2.txt in vim and press g, then Ctrl+g. This is what we get:

counting words with vim

In this case, the word count is 10, which is incorrect because of the punctuation.

If we apply the following commands in the above vim session:

:%s/ *,/,/g

and follow it by g and Ctrl+g, we get:

counting words with vim after editing

This gives us the correct word count of 9.

We can get our original file back by hitting the ‘u’ (undo) command.

6. Conclusion

In this article, we showed how to count the total number of words in a file. Word counting is complicated by punctuation, which we may or may not want to eliminate.

Authors Bottom

If you have a few years of experience in the Linux ecosystem, and you’re interested in sharing that experience with the community, have a look at our Contribution Guidelines.

Comments are closed on this article!