1. Introduction

File deletion is a big part of Linux administration. Whether manually or with scripts, we delete files as part of upgrades, log rotation, backups, and many other activities. Since directories can contain large amounts of files, knowing how to handle them optimally can save a lot of time.

In this tutorial, we explore how to efficiently delete a large directory in Linux. First, we discuss file deletion in general. After that, we show when, how, and why large directories come about. Next, we test several tools in terms of their functionality and performance when dealing with many files.

We tested the code in this tutorial on Debian 11 (Bullseye) with GNU Bash 5.1.4. It is POSIX-compliant and should work in any such environment.

2. File Deletion

Under Linux, files are inodes. An inode stores file metadata, including where file contents are. On the other hand, directories are lists of names pointing to inodes.

Because of this, there are different ways to delete files.

Once there is no hard link or handle left to a file, its inode becomes available. When that happens, the kernel marks the inode number as free:

$ touch /file.ext
$ tail --follow /file.ext &
[1] 667
$ lsof /file.ext
COMMAND PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
tail    667 root    3r   REG   8,16        0  666 file.ext
$ rm /file.ext

First, we create a file and open it in tail for watching. After that, we use the lsof (List Open Files) command to confirm a handle to the file exists. Finally, we remove the actual file.

As a result, we only have the lingering inode due to an open handle. Killing the background tail process would purge that.

2.2. Purging

Importantly, file metadata and contents can remain intact on the storage until overwritten, i.e., purged. This behavior varies between older and newer ext filesystems. It’s like selling a house with everything from the last owner still in it. What does this mean for us?

We won’t need to bother calling the movers. Similarly, there are two main reasons it’s costly to rewrite segments of storage with data: slowness and wear.

Since inodes are kilobytes at most, ext3 and later versions do indeed zero them out but do not bother to purge contents. How does this behavior relate to file containers?

3. Create a Large Directory

From the above, we can deduce that removing directories would be most efficient by just removing all references – directory and contents. In practice, that means size is not so much the issue, but object quantity is.

File stores with thousands or millions of entries exist for many reasons:

  • log rotations
  • database files
  • distributed filesystems
  • specific use cases

Importantly, how well the kernel deals with many files depends strongly on the filesystem type. For example, XFS might be slow with multiple small files, while ReiserFS was specifically made for the purpose of handling them.

Now, let’s create a directory with 1 million files:

$ mkdir /dir1m
$ for $f in {1..1000000}; do touch /dir1m/$f.ext; done

We’ll test /dir1m with some tools for deleting. With time, we’ll see how fast an operation runs.

4. Delete a Large Directory With rm (Remove)

The classic rm does indeed only unlink files and doesn’t purge them.

However, there are a couple of ways to do that for directories, which we’ll look at.

4.1. Wildcards

Combining rm with globbing, we might experience issues:

 $ rm --force /dir1m/*.ext
/bin/rm: cannot execute [Argument list too long]

The problem here is that wildcard expansion means all 1 million filenames become arguments. Consequently, the command line gets too long, and the shell refuses to execute.

However, we’ve no reason to use this syntax if we want the whole directory removed.

4.2. Recursion

The –recursive (-r) flag is best when dealing with many files. In fact, it’s necessary to use recursion in order to delete a directory or subdirectory:

$ time rm --recursive --force /dir1m
real    13.57s
user    1.04s
sys     8.11s
cpu     67%

This is our first real result: it took around 14 seconds to delete 1 million files. So what alternatives do we have to the standard rm?

5. Finding and Deleting Files With find

Of course, we can use the find command to remove files. However, it would use much more resources and tons more time to complete.

One improvement would be to use the GNU -delete switch to find:

$ time find /dir1m -delete
real    29.93s
user    1.11s
sys     8.40s
cpu     31%

Doing this avoids the rm command calls. Additionally, we can get a better performance via xargs:

$ time find /dir1m -print0 | xargs --null --no-run-if-empty rm --recursive --force
real    12.80s
user    1.16s
sys     8.62s
cpu     76%

Basically, we just output NULL-separated file paths and pass them to xargs, which runs rm. For a single directory, the performance is the same with or without find or xargs.

Except for the last one, all of these options are slow mainly because they don’t use the internal iteration of rm with –recursive. Furthermore, they needlessly go through each file. This would only make sense when we filter what gets deleted.

6. Deleting a Large Directory With rsync

An unlikely option for efficient deletion is the rsync command:

$ mkdir /void
$ time rsync --archive --delete /void/ /dir1m/
real    15.74s
user    1.50s
sys     12.47s
cpu     88%
$ rm --recursive --force /void /dir1m

First, we create an empty directory: /void. Next, we synchronize /dir1m to the empty /void via the –archive and –delete flags and remove the leftovers.

Similar to rm, rsync uses the unlink() system call. Unlike rm, rsync doesn’t do much else.

There is another option that works the same way.

7. Using perl to Delete Directory Contents

In fact, perl is useful not only for text processing but file operations as well. Written in C, it’s also suitable for low-level system calls:

$ cd /dir1m
$ time perl -e 'for(<*>){((stat)[9]<(unlink))}'
real    17.05s
user    2.57s
sys     13.36s
cpu     93%

Here, we use -e (execute) to execute a one-liner, which goes calls unlink() on all files in the current directory via <*>.

Due to the overhead of a scripting language and its interpreter, this method is slightly slower than rsync and rm. Still, perl provides options for precise filtering, should we require that.

8. Summary

In this article, we discussed methods for efficiently deleting a directory in Linux.

The clear winner in our tests is the rm command. However, if we want to have some control over what we remove, then find and perl are viable alternatives.

In conclusion, we should always define what’s to be done to choose the most efficient way to do it.

Comments are closed on this article!