Sometimes we need to find out whether we have duplicate files on our Linux filesystem.
In this short tutorial, we’ll explore several tools to compare whether two files have identical content.
We’ll also benchmark our tools to find the fastest.
2. Basic Comparison
We can see if two files have the same content by calculating their hash values.
Let’s create three files and compare their SHA1 hashes with sha1sum:
$ echo baeldung > /tmp/file1 $ echo baeldunq > /tmp/file2 $ echo baeldung > /tmp/file3 $ sha1sum /tmp/file1 3a5602d2d404f45f0d3d9591d41a74b6be59e507 /tmp/file1 $ sha1sum /tmp/file2 afd85322f7a10ed08177b4b03a33cf0fce5a0ef9 /tmp/file2 $ sha1sum /tmp/file3 3a5602d2d404f45f0d3d9591d41a74b6be59e507 /tmp/file3
As we can see, file1 and file3 have the same content as their hashes match, whereas file2 is different.
3. Comparison Using diff
GNU diff can compare two files line by line and report all the differences.
We can run diff with the flags -sq:
-q, --brief report only when files differ -s, --report-identical-files report when two files are the same
Let’s verify if file1, file2, and file3 have the same contents using diff:
$ diff -sq /tmp/file1 /tmp/file2 Files /tmp/file1 and /tmp/file2 differ $ diff -sq /tmp/file1 /tmp/file3 Files /tmp/file1 and /tmp/file3 are identical
The output indicates that file1 and file3 are the same, and file2 has different contents.
4. Comparison Using cmp
GNU cmp compares two files byte by byte and prints the location of the first difference. We can pass the -s flag to find out if the files have the same content.
Since -s suppresses all output, we should echo the exit code to know the result. Exit code 0 means file contents are the same, and 1 means contents are different:
$ cmp -s /tmp/file1 /tmp/file2 $ echo $? 1 $ cmp -s /tmp/file1 /tmp/file3 $ echo $? 0
Since the contents of file1 and file2 are different, cmp exited with status 1. $? is a special variable that always holds the exit status of the last command. As indicated by the exit code 0, file1 and file3 have the same content.
To see how fast these commands are on big files, let’s create a large file of random content using dd. We need to try this out on a volume with lots of space, as this example consumes 1GB:
$ dd if=/dev/urandom of=/tmp/file1 count=1K bs=1MB 1024+0 records in 1024+0 records out 1024000000 bytes (1.0 GB, 977 MiB) copied, 39.7297 s, 25.8 MB/s
Next, we’ll copy file1 to file2, and append different characters to each file:
$ cp /tmp/file1 /tmp/file2 $ echo 1 >> /tmp/file1 $ echo 2 >> /tmp/file2
Now, we’ll use the time command to measure the time taken by each of the three tools:
$ time cmp -s /tmp/file1 /tmp/file2 cmp -s /tmp/file1 /tmp/file2 0.19s user 0.44s system 99% cpu 0.636 total $ time diff -sq /tmp/file1 /tmp/file2 Files /tmp/file1 and /tmp/file2 differ diff -sq /tmp/file1 /tmp/file2 0.20s user 0.44s system 99% cpu 0.640 total $ time sha1sum /tmp/file1 f998e30863cbbe56bdd897997a79e333400a5369 /tmp/file1 sha1sum /tmp/file1 1.07s user 0.17s system 99% cpu 1.246 total $ time sha1sum /tmp/file2 05cbc671d3d38102f4c90c01153a58aaacec69af /tmp/file2 sha1sum /tmp/file2 1.10s user 0.14s system 49% cpu 2.481 total
Evidently, cmp -s, and diff -sq take almost the same time to verify if two files have the same content. Comparison by hashing is noticeably slower.
We should note that if we pass -sq flags, diff first checks the file sizes and instantly reports a mismatch if the sizes are different.
Similarly, diff -sq instantly reports a match without checking the contents if we compare a file with itself. This produces very fast results.
The command cmp -s also perform these prechecks.
In this article, we explored three tools to verify whether two given files have the same content.
We compared their execution times and found that hashing is the slowest.