1. Introduction

In this tutorial, we’re going to take a look at some different ways of finding duplicate files in Unix systems.

2. File Structure

First, let’s have a quick look at the file structure we’ll use for our examples:

.
+--baeldung
|  +--folder1
|  |  +--text-file-1
|  |  |  Content: "I am not unique"
|  |  +--text-file-2
|  |  |  Content: "Some random content 1"
|  |  +--unique-file-1
|  |  |  Content: "Some unique content 1\nI am a very long line!"
|  +--folder2
|  |  +--text-file-1
|  |  |  Content: "I am not unique"
|  |  +--text-file-2
|  |  |  Content: "Some random content 2"
|  |  +--unique-file-2
|  |  |  Content: "Some unique content 2! \n I am a short line."
|  +--folder3
|  |  +--text-file-1
|  |  |  Content: "I am not unique"
|  |  +--text-file-2
|  |  |  Content: "Some random content 3"
|  |  +--unique-file-3
|  |  |  Content: "Some unique content 3\nI am an extreme long line............"

The baeldung directory will be our test directory. Inside, we have three folders: folder1, folder2, and folder3. Each one of them contains a text-file-1 file with the same content and a text-file-2 with different content in each folder. Also, each folder contains a unique-file-x file which has both unique name and content.

3. Find Duplicate Files by Name

The most common way of finding duplicate files is to search by file name. We can do this using a script:

awk -F'/' '{
  f = $NF
  a[f] = f in a? a[f] RS $0 : $0
  b[f]++ } 
  END{for(x in b)
        if(b[x]>1)
          printf "Duplicate Filename: %s\n%s\n",x,a[x] }' <(find . -type f)

Running it in the baeldung directory should list all files with non-unique names:

Duplicate Filename: textfile1
./folder3/textfile1
./folder2/textfile1
./folder1/textfile1
Duplicate Filename: textfile2
./folder3/textfile2
./folder2/textfile2
./folder1/textfile2

Now, let’s go through the script and explain what it does.

  •  <(find . – type f) – Firstly, we use process substitution so that the awk command can read the output of the find command
  • find . -type f – The find command searches for all files in the searchPath directory
  • awk -F’/’ – We use ‘/’ as the FS of the awk command. It makes extracting the filename easier. The last field will be the filename
  • f = $NF – We save the filename in a variable f
  • a[f] = f in a? a[f] RS $0 : $0 – If the filename doesn’t exist in the associative array a[], we create an entry to map the filename to the full-path. Otherwise, we add a new line RS and append the full path to a[f]
  • b[f]++ – We create another array b[] to record how many times a filename f has been found
  • END{for(x in b) – Finally, in the END block, we go through all entries in the array b[]
  • if(b[x]>1) – If the filename x has been seen more than once, that is, there are more files with this filename
  • printf “Duplicate Filename: %s\n%s\n”,x,a[x] – Then we print the duplicated filename x, and print all full-paths with this filename: a[x]

Note that in this example, we’re only searching for duplicate file names. In the next sections, we’ll discover different methods of finding duplicate files by their content.

4. Find Duplicate Files by MD5 Checksum

The MD5 message-digest algorithm is a widely used hash function producing a 128-bit hash value base on the file content. It was initially designed to be used as a cryptographic hash function, but it’s still widely used as a checksum to verify data integrity.

In Linux, we can use the md5sum command to get the MD5 hash of a file.

Because MD5 is generated from the file content, we can use it to find duplicate files:

awk '{
  md5=$1
  a[md5]=md5 in a ? a[md5] RS $2 : $2
  b[md5]++ } 
  END{for(x in b)
        if(b[x]>1)
          printf "Duplicate Files (MD5:%s):\n%s\n",x,a[x] }' <(find . -type f -exec md5sum {} +)

As we can see, it’s quite similar to the previous one where we were searching by file name. However, we additionally generate an MD5 hash for every file using the -exec md5sum {} + parameter added to the find command.

Let’s run it in our test directory and check the output:

Duplicate Files (MD5:1d65953b527afb4bd9bc0986fd0b9547):
./folder3/textfile1
./folder2/textfile1
./folder1/textfile1

As we can see, although we have three files named text-file-2, they will not appear in the search by MD5 hash because their content is unique.

5. Find Duplicate Files by Size

When there is a large number of files to check, calculating the hash on each one of them could take a long time. In such situations, we could start by finding files with the same size and then apply a hash check on them. This will speed up the search because all the duplicate files should have the same file size.

We can use the du command to calculate the size of a file.

Let’s write a script to find files with the same size:

awk '{
  size = $1
  a[size]=size in a ? a[size] RS $2 : $2
  b[size]++ } 
  END{for(x in b)
        if(b[x]>1)
          printf "Duplicate Files By Size: %d Bytes\n%s\n",x,a[x] }' <(find . -type f -exec du -b {} +)

In this example, we add the -exec du -b {} + parameter to the find command to pass the size of each file to the awk command.

Executing it in the baeldung/ directory will produce the output:

Duplicate Files By Size: 16 Bytes
./folder3/textfile1
./folder2/textfile1
./folder1/textfile1
Duplicate Files By Size: 22 Bytes
./folder3/textfile2
./folder2/textfile2
./folder1/textfile2

These results are not correct in terms of content duplication because every test-file-2 has different content, even if they have the same size.

However, we can then use this input to perform other duplication checks on a smaller scale.

6. Find Duplicate Files Using fdupes and jdupes

There are a lot of ready-to-use programs that combine many methods of finding duplicate files like checking the file size and MD5 signatures.

One popular tool is fdupes. It works by comparing the files by sizes and MD5 signatures. If they are equal, it follows by a byte-by-byte comparison.

jdupes is considered as an enhanced fork of fdupes. In testing on various data sets, jdupes seems to be much faster than fdupes on average.

To search for duplicate files using fdupes, we type:

fdupes -r .

And to search duplicates with jdupes:

jdupes -r .

Both of these commands will result in the same output:

./folder1/text-file-1
./folder2/text-file-1
./folder3/text-file-1

Beware — though jdupes is very similar to fdupes, from which it was initially derived, jdupes is not developed to be a compatible replacement for fdupes.

7. Conclusion

In this tutorial, we’ve learned how to find duplicate files in Unix systems using the file name, checksum, fdupes, and jdupes.

Subscribe
Notify of
guest
3 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
rustycode
rustycode
1 month ago

There is also a much faster modern alternative to fdupes and jdupes: fclones. It searches for files in parallel and uses a much faster hash function than md5.

Loredana Crusoveanu
Loredana Crusoveanu
1 month ago
Reply to  rustycode

Hi,
Thanks for your inputs, this looks like a great library.

Jody
1 month ago
Reply to  rustycode

I noticed that fclones does not do the byte-for-byte safety check that jdupes (and fdupes) does. It also relies exclusively on a non-cryptographic hash for comparisons. It is unsafe to rely on a non-cryptographic hash as a substitute for the file data, and comparisons between duplicate finders running in full-file comparison mode vs. running in hash-and-compare mode are not appropriate. The benchmark on the fclones page ran jdupes 1.14 without the -Q option that disables the final byte-for-byte confirmation, so there is a lot of extra work for the purpose of avoiding potential data loss being done by jdupes and… Read more »