1. Introduction

There are officially only two characters that should not be parts of Linux path names:

  • / forward slash separator
  • \0 NULL potential terminator

Thus, implementations should allow for any other character to be present in elements of paths.

In this tutorial, we explore ways to find path elements with names that contain special characters, i.e., non-printable and non-ASCII characters. First, we discuss locale settings in relation to character code tables. After that, we reason about our method of choice. Finally, we show an efficient way to handle the problem.

Although whitespace characters outside space are also technically printable, we include them in the non-printable range due to their special nature in most cases.

We tested the code in this tutorial on Debian 11 (Bullseye) with GNU Bash 5.1.4. It should work in most POSIX-compliant environments.

2. Locale Settings

When dealing with encoding and character code tables, it’s common to talk about locales within the context of the current terminal and shell.

In general, a locale defines character code tables. These allow us to know the difference between similar-looking but different characters like Cyrillic а and Latin a:

$ printf %x "'а"
$ printf %x "'a"

Importantly, the current locale settings depend on many factors, most commonly linked to the permanent values of locale environment variables:

On the other hand, Bash enables us to temporarily change locale environment variables during a given command:

$ LC_ALL=C printf %x "'а"
$ printf %x "'а"

Here, we use the C locale configuration. This affects the code table, often resulting in tools and applications interpreting the same character differently.

3. Methods

Naturally, there are many ways to approach the problem of finding files and directories with special character names.

For example, we can first list the contents of the path we want to search with tools like ls or tree:

$ tree
├── subdir1
│   └── file
└── subdir2

2 directories, 1 file

After getting the list, we can pipe it through a utility like grep, sed, or awk to find non-ASCII or non-printable characters.

However, there are major drawbacks to this and similar approaches:

  • not all listings contain one object per line
  • most listings use (non-printable) newlines outside object names
  • requires a complex command with pipes

Since all methods require external tools, let’s use a standard one that does it all in one go.

4. Using find

As usual, leveraging the common POSIX-standard find tool is perhaps the best way to search for filesystem elements of any kind.

Of course, we can search for an object in the filesystem by name:

$ find / -name 'os-release'

In this case, we do a default recursive search of the root directory for any file or directory with the name os-release.

Better yet, we can employ a regular expression (regex) match against their names:

$ find / -name '*-release'

This allows us to use [] character groups and a-b ranges.

4.3. Find Any Non-printable Non-ASCII Character

So, let’s standardize the locale and search for a name that includes at least one non-printable non-ASCII character in the current working directory:

$ LC_ALL=C find . -name '*[! -~]*'

The range from space to tilde includes all printable ASCII characters. Thus, negating it with an ! exclamation point within a [] character group matches only characters outside that range. Also, the * asterisk ensures we match any other characters around this one.

Alternatively, we can use the [:print:] bracket expression instead of the range above:

$ LC_ALL=C find . -name '*[![:print:]]*'

Our choice depends mainly on readability versus clarity.

4.4. Output Handling

Importantly, the resulting names might include a ? question mark in place of non-printable and non-ASCII characters. While this can be beneficial, it’s not always desirable. To avoid hiding the special characters, we can use the $LC_COLLATE less-general locale environment variable:

$ LC_COLLATE=C find . -name '*[! -~]*'

This way, the output should read as expected.

Importantly, we won’t usually get the desired results if we replace the [! -~] character range above with [![:print:]]. This is a result of the way these predefined ranges are interpreted based on the full locale set, i.e., $LC_ALL.

5. Summary

In this article, we talked about finding filesystem objects with non-printable and non-ASCII characters in their name.

In conclusion, although we have different ways to tackle the problem, a single standard tool can do it efficiently.

Inline Feedbacks
View all comments
Comments are closed on this article!