1. Overview

Sometimes we need to find out whether we have duplicate files on our Linux filesystem.

In this short tutorial, we’ll explore several tools to compare whether two files have identical content.

We’ll also benchmark our tools to find the fastest.

2. Basic Comparison

We can see if two files have the same content by calculating their hash values.

Let’s create three files and compare their SHA1 hashes with sha1sum:

$ echo baeldung > /tmp/file1
$ echo baeldunq > /tmp/file2
$ echo baeldung > /tmp/file3
$ sha1sum /tmp/file1
3a5602d2d404f45f0d3d9591d41a74b6be59e507  /tmp/file1
$ sha1sum /tmp/file2
afd85322f7a10ed08177b4b03a33cf0fce5a0ef9  /tmp/file2
$ sha1sum /tmp/file3                                                                                                            
3a5602d2d404f45f0d3d9591d41a74b6be59e507  /tmp/file3

As we can see, file1 and file3 have the same content as their hashes match, whereas file2 is different.

3. Comparison Using diff

GNU diff can compare two files line by line and report all the differences.

We can run diff with the flags -sq:

-q, --brief
    report only when files differ
-s, --report-identical-files
    report when two files are the same

Let’s verify if file1, file2, and file3 have the same contents using diff:

$ diff -sq /tmp/file1 /tmp/file2
Files /tmp/file1 and /tmp/file2 differ
$ diff -sq /tmp/file1 /tmp/file3
Files /tmp/file1 and /tmp/file3 are identical

The output indicates that file1 and file3 are the same, and file2 has different contents.

4. Comparison Using cmp

GNU cmp compares two files byte by byte and prints the location of the first difference. We can pass the -s flag to find out if the files have the same content.

Since -s suppresses all output, we should echo the exit code to know the result. Exit code 0 means file contents are the same, and 1 means contents are different:

$ cmp -s /tmp/file1 /tmp/file2
$ echo $?
1
$ cmp -s /tmp/file1 /tmp/file3
$ echo $?
0

Since the contents of file1 and file2 are different, cmp exited with status 1. $? is a special variable that always holds the exit status of the last command. As indicated by the exit code 0, file1 and file3 have the same content.

5. Benchmark

To see how fast these commands are on big files, let’s create a large file of random content using dd. We need to try this out on a volume with lots of space, as this example consumes 1GB:

$ dd if=/dev/urandom of=/tmp/file1 count=1K bs=1MB
1024+0 records in
1024+0 records out
1024000000 bytes (1.0 GB, 977 MiB) copied, 39.7297 s, 25.8 MB/s

Next, we’ll copy file1 to file2, and append different characters to each file:

$ cp /tmp/file1 /tmp/file2
$ echo 1 >> /tmp/file1
$ echo 2 >> /tmp/file2

Now, we’ll use the time command to measure the time taken by each of the three tools:

$ time cmp -s /tmp/file1 /tmp/file2                                                               
cmp -s /tmp/file1 /tmp/file2  0.19s user 0.44s system 99% cpu 0.636 total
$ time diff -sq /tmp/file1 /tmp/file2                                                          
Files /tmp/file1 and /tmp/file2 differ
diff -sq /tmp/file1 /tmp/file2  0.20s user 0.44s system 99% cpu 0.640 total
$ time sha1sum /tmp/file1
f998e30863cbbe56bdd897997a79e333400a5369  /tmp/file1
sha1sum /tmp/file1  1.07s user 0.17s system 99% cpu 1.246 total
$ time sha1sum /tmp/file2
05cbc671d3d38102f4c90c01153a58aaacec69af  /tmp/file2
sha1sum /tmp/file2  1.10s user 0.14s system 49% cpu 2.481 total

Evidently, cmp -s, and diff -sq take almost the same time to verify if two files have the same content. Comparison by hashing is noticeably slower.

We should note that if we pass -sq flags, diff first checks the file sizes and instantly reports a mismatch if the sizes are different.

Similarly, diff -sq instantly reports a match without checking the contents if we compare a file with itself. This produces very fast results.

The command cmp -s also perform these prechecks.

6. Conclusion

In this article, we explored three tools to verify whether two given files have the same content.

We compared their execution times and found that hashing is the slowest.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments