1. Overview

Sometimes, we want to work with only text files. For some operating systems like Windows, we can easily determine the type of a file by looking at the extension in the filename. For example, files with the “.txt” extension are plain ASCII text files. But this is not the case for Linux.

Linux classifies a variety of files as text, so to determine whether a file is of type text or not, it’s necessary to examine the content of the file, and the file command is the right tool for the job. The file command classifies files broadly as text, binary, or data files. So in this tutorial, we’ll learn how to use the file command to identify all the text files in a directory.

2. Types of Text Files

A text file is any file that consists of printable characters and few common control characters like newline ‘\n’, carriage return ‘\r’, and tab spaces ‘\t’. On the other hand, non-text files like most data or executable binary files consist of non-printable characters.

Now, let’s see how to use the file command to list types of all files in a directory:

$ file *
filename1: ASCII text 
Dirname1: directory 
filename3.perf: data 
filename4.py: Python script, ASCII text executable

The command list all files in a directory with an informative description of their content. There are two types of files that have “text” in their description.

2.1. Plain Text Files

A file containing only printable ASCII characters or Unicode UTF-8 characters with some common control characters is a plain text file. The file command classifies such files as “ASCII text” or “UTF-8 Unicode text” files.

The files containing the Unicode encoding UTF-16 or UTF-32 consist of printable text, but they require translation before printing. Therefore, their description given by the file command contains the phrase “character data” instead of “text”.

2.2. Source Code or Script Files

Source code files of many programming or scripting languages are also text files. For example, a shell script file, a python script file, a C code file, and a Java source file are all text files. Such source code files contain ASCII printable text with some common control characters.

3. Find All Text Files

There could be multiple scenarios when we need to know only all the text files in a directory, including files in sub-directories. One such scenario is when we want to copy all source files, but not compiled intermediate and executable files. This ensures that when we copy a project, the volume of data is much more manageable than, say when copying everything.

Another possible scenario is when we want to search for a certain text phrase in all source or script files.

The file command comes in handy in such situations.

3.1. In a Current Directory

If our interest is only in files in the current directory, then the command is effortless. We can quickly list all text files in the current directory using a simple command:

file * | grep ":.* text"

The file * command prints a list of filenames followed by the file type description of all files in the current directory. The subsequent grep command filters all files with “ASCII text” or “UTF-8 Unicode text” as part of the file type description. The expression “:.*” ensures that the sub-string ” text” is only searched in the file type description, not in the filename. In other words, the expression means “search for a : in the file command output followed by any number of characters “.*” and then the sub-string ” text”.

In the rare case in which some of the files contain the character “:” as part of their name, we can use the -F argument of the file command to use some character other than the “:” character as the separator. The grep command needs to be modified accordingly, putting the other character used as a separator in place of “:”.

The output file list includes plain text, scripts, formatted text, and source code files.

If we want to find all text files in the current directory, including its sub-directories, then we have to augment the command using the Linux find command:

find . -type f -exec file {} \; | grep ":.* ASCII text"

Here, the argument “.” means “look in the current directory and all its sub-directories”. The argument -type f makes find to consider only files, not directories. The argument -exec file {} \; means execute the file command on all files provided by find. The output of the find command is then piped to the grep command. Then, the grep command filters all files that have the text “ASCII text” in their description.

3.3. Find Only Plain Text Files

At times, our interest lies only in finding out plain text files. In other words, we want to exclude files containing source code, scripts, or formatted text (such as XML or HTML). The command for that is:

$ file -i * | grep " text/plain;" 
file1:      text/plain; charset=us-ascii
file2:      text/plain; charset=utf-8

Due to the -i flag, the file command prints the file type descriptions as mime type strings. The “text/plain;” is not there in the description of other text files like “C source” or “Python script file”. So, to particularly filter plain text files, we can use this command.

4. Conclusion

In this article, we learned about the usage of file command with grep and find to filter out text files in a directory.