
Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: May 11, 2022
In this tutorial, we’re going to take a look at some different ways of finding duplicate files in Unix systems.
First, let’s have a quick look at the file structure we’ll use for our examples:
.
+--baeldung
| +--folder1
| | +--text-file-1
| | | Content: "I am not unique"
| | +--text-file-2
| | | Content: "Some random content 1"
| | +--unique-file-1
| | | Content: "Some unique content 1\nI am a very long line!"
| +--folder2
| | +--text-file-1
| | | Content: "I am not unique"
| | +--text-file-2
| | | Content: "Some random content 2"
| | +--unique-file-2
| | | Content: "Some unique content 2! \n I am a short line."
| +--folder3
| | +--text-file-1
| | | Content: "I am not unique"
| | +--text-file-2
| | | Content: "Some random content 3"
| | +--unique-file-3
| | | Content: "Some unique content 3\nI am an extreme long line............"
The baeldung directory will be our test directory. Inside, we have three folders: folder1, folder2, and folder3. Each one of them contains a text-file-1 file with the same content and a text-file-2 with different content in each folder. Also, each folder contains a unique-file-x file which has both unique name and content.
The most common way of finding duplicate files is to search by file name. We can do this using a script:
awk -F'/' '{
f = $NF
a[f] = f in a? a[f] RS $0 : $0
b[f]++ }
END{for(x in b)
if(b[x]>1)
printf "Duplicate Filename: %s\n%s\n",x,a[x] }' <(find . -type f)
Running it in the baeldung directory should list all files with non-unique names:
Duplicate Filename: textfile1
./folder3/textfile1
./folder2/textfile1
./folder1/textfile1
Duplicate Filename: textfile2
./folder3/textfile2
./folder2/textfile2
./folder1/textfile2
Now, let’s go through the script and explain what it does.
Note that in this example, we’re only searching for duplicate file names. In the next sections, we’ll discover different methods of finding duplicate files by their content.
The MD5 message-digest algorithm is a widely used hash function producing a 128-bit hash value base on the file content. It was initially designed to be used as a cryptographic hash function, but it’s still widely used as a checksum to verify data integrity.
In Linux, we can use the md5sum command to get the MD5 hash of a file.
Because MD5 is generated from the file content, we can use it to find duplicate files:
awk '{
md5=$1
a[md5]=md5 in a ? a[md5] RS $2 : $2
b[md5]++ }
END{for(x in b)
if(b[x]>1)
printf "Duplicate Files (MD5:%s):\n%s\n",x,a[x] }' <(find . -type f -exec md5sum {} +)
As we can see, it’s quite similar to the previous one where we were searching by file name. However, we additionally generate an MD5 hash for every file using the -exec md5sum {} + parameter added to the find command.
Let’s run it in our test directory and check the output:
Duplicate Files (MD5:1d65953b527afb4bd9bc0986fd0b9547):
./folder3/textfile1
./folder2/textfile1
./folder1/textfile1
As we can see, although we have three files named text-file-2, they will not appear in the search by MD5 hash because their content is unique.
When there is a large number of files to check, calculating the hash on each one of them could take a long time. In such situations, we could start by finding files with the same size and then apply a hash check on them. This will speed up the search because all the duplicate files should have the same file size.
We can use the du command to calculate the size of a file.
Let’s write a script to find files with the same size:
awk '{
size = $1
a[size]=size in a ? a[size] RS $2 : $2
b[size]++ }
END{for(x in b)
if(b[x]>1)
printf "Duplicate Files By Size: %d Bytes\n%s\n",x,a[x] }' <(find . -type f -exec du -b {} +)
In this example, we add the -exec du -b {} + parameter to the find command to pass the size of each file to the awk command.
Executing it in the baeldung/ directory will produce the output:
Duplicate Files By Size: 16 Bytes
./folder3/textfile1
./folder2/textfile1
./folder1/textfile1
Duplicate Files By Size: 22 Bytes
./folder3/textfile2
./folder2/textfile2
./folder1/textfile2
These results are not correct in terms of content duplication because every test-file-2 has different content, even if they have the same size.
However, we can then use this input to perform other duplication checks on a smaller scale.
There are a lot of ready-to-use programs that combine many methods of finding duplicate files like checking the file size and MD5 signatures.
One popular tool is fdupes. It works by comparing the files by sizes and MD5 signatures. If they are equal, it follows by a byte-by-byte comparison.
jdupes is considered as an enhanced fork of fdupes. In testing on various data sets, jdupes seems to be much faster than fdupes on average.
To search for duplicate files using fdupes, we type:
fdupes -r .
And to search duplicates with jdupes:
jdupes -r .
Both of these commands will result in the same output:
./folder1/text-file-1
./folder2/text-file-1
./folder3/text-file-1
Beware — though jdupes is very similar to fdupes, from which it was initially derived, jdupes is not developed to be a compatible replacement for fdupes.
In this tutorial, we’ve learned how to find duplicate files in Unix systems using the file name, checksum, fdupes, and jdupes.