Verifying MD5 Hash Values for a Large Number of Files

1. Overview

The MD5 checksum is a cryptographic hash useful for verifying file integrity. For example, we can compute the MD5 checksum of a file we’ve downloaded from a trusted source and compare it against the published hash value. If the two hashes match, the integrity is verified. Otherwise, the file content may be corrupt or it may have been tampered with.

In this tutorial, we’ll explore how to verify the MD5 checksum for a large number of files in Bash.

2. Sample Task

Let’s create several files in the current directory with different content in each:

$ for i in {1..5}; do echo "${i}" > "file${i}"; done

Here, we use a for loop to iterate over integers from 1 to 5 and write each integer to a separate file.

We can list the files using the ls command:

$ ls
file1  file2  file3  file4  file5

To compute the MD5 checksum for a file such as file1, we can use the md5sum command:

$ md5sum file1
b026324c6904b2a9cb4b88d6d61c81d1  file1

Notably, the command returns the MD5 hash of the file followed by the filename. In this case, the computed hash value is the same as the expected hash since we’ve created the file ourselves, and we’re sure of its integrity.

On the other hand, if we had obtained file1 from an external source, we should compare the computed hash value with that published by the source, if available.

In general, the MD5 checksum has a fixed length of 32 hexadecimal characters. Therefore, comparing two hash values visually is rarely practical, especially when we have to repeat this for a large number of files.

Let’s explore how we can automatically verify the MD5 checksum both for a single file and for several files.

3. Verify MD5 Hash for a Single File

We can use the -c option of md5sum to check if the MD5 hash of a file matches its declared value:

$ md5sum -c <<< "b026324c6904b2a9cb4b88d6d61c81d1  file1"
file1: OK

In this case, we use a here-string to provide the expected hash value followed by the filename. The md5sum -c command computes the MD5 hash of file1 and compares it against the provided value. Then, it returns an OK message since the two hashes match.

Notably, it’s standard convention to separate the filename from the expected hash value using two space characters. However, the md5sum -c command is flexible and can handle variations, such as having a single space instead of two, or a space character followed by a * symbol.

Moreover, we don’t need to distinguish between text and binary files when using the md5sum command in Linux.

Another way to pass input to the md5sum -c command is across a pipe:

$ echo "b026324c6904b2a9cb4b88d6d61c81d1  file1" | md5sum -c -
file1: OK

In this case, we use echo to print the input in the conventional format. Then, we pipe the result to the md5sum -c command via stdin as indicated by the hyphen (–) appearing at the end of the command.

Now, let’s test against a different MD5 hash for file1:

$ echo "ef7ab26f9a3b2cbd35aa3e7e69aad86c  file1" | md5sum -c -
file1: FAILED
md5sum: WARNING: 1 computed checksum did NOT match

As expected, we get a message that the computed hash and the provided one don’t match.

4. Verify MD5 Hashes for Multiple Files

One way to verify MD5 checksums for multiple files is by using the md5sum -c command. Another approach is to use the hashlib module in Python to carry out the verification. Let’s delve into both approaches.

4.1. Using md5sum -c

We can use the md5sum command to verify MD5 hashes for multiple files. In particular, the -c option of md5sum can accept a text file as input. Each line of the text file should contain an MD5 hash followed by two space characters and a filename.

Consequently, the command compares the computed hash value against the provided one in each line of the text file.

As an example, let’s use md5sum to compute the hash values of the files in our current directory and save the result to a file named md5sum.txt:

$ md5sum * | tee md5sum.txt
b026324c6904b2a9cb4b88d6d61c81d1  file1
26ab0db90d72e28ad0ba1e22ee510510  file2
6d7fce9fee471194aa8b5b6e47267f03  file3
48a24b70a0b376535542b996af517398  file4
1dcca23355272056f04fe8bf20edfce0  file5

Here, we use the tee command to simultaneously display the output on stdout and write the output to the md5sum.txt file.

Next, we run the md5sum -c command with the md5sum.txt file as argument:

$ md5sum -c md5sum.txt
file1: OK
file2: OK
file3: OK
file4: OK
file5: OK

Expectedly, the computed hash values match the ones provided in the md5sum.txt file.

Let’s now modify the content of file2 by appending a new line to it:

$ echo "new line" >> file2

Since the content of file2 has changed, we expect its MD5 checksum to change as well.

So, let’s run the md5sum -c command once more over the md5sum.txt file:

$ md5sum -c md5sum.txt
file1: OK
file2: FAILED
file3: OK
file4: OK
file5: OK
md5sum: WARNING: 1 computed checksum did NOT match

This time, the output shows that the newly computed checksum for file2 doesn’t match the value found in md5sum.txt. The mismatch occurs since the content of file2 has changed, whereas its MD5 checksum in md5sum.txt still refers to the older version of the file.

4.2. Using Python

Alternatively, we can use a Python script to carry out the MD5 hash verification of all hashes and filenames listed in a text file such as md5sum.txt.

To demonstrate, let’s retain the changes made to file2 and write a Python script named verify_hashes.py to process the md5sum.txt file:

$ cat verify_hashes.py
#!/usr/bin/env python3
import sys
import hashlib

path_to_file = sys.argv[1]
with open(path_to_file, "r") as file:
    for line in file:
        expected_hash, filename = line.strip().split()
        with open(filename, "rb") as file_to_check:
            data = file_to_check.read()
        computed_hash = hashlib.md5(data).hexdigest()
        if (computed_hash == expected_hash):
            print(f"{filename}:  OK")
        else:
            print(f"{filename}:  FAILED")

The script starts with a Python shebang directive and implements several steps:

import the sys and hashlib modules for parsing command-line arguments and computing MD5 checksums, respectively
save the first command-line argument of the script in a variable named path_to_file and open that file
read each line of the file, strip any surrounding whitespace, and split the line using whitespace as a delimiter
save the first part of the split line in the expected_hash variable and the second part in the filename variable
open the file specified by the filename variable for reading in binary mode, and save its content in a variable named data
use the hexdigest() method to compute the MD5 hash value of the file content stored in the data variable
print an OK message if the computed hash matches the value of the expected_hash variable; otherwise, print a FAILED message

Let’s grant the script execute permissions using chmod:

$ chmod u+x verify_hashes.py

Finally, let’s run the script with the md5sum.txt file as argument:

$ ./verify_hashes.py md5sum.txt
file1:  OK
file2:  FAILED
file3:  OK
file4:  OK
file5:  OK

This way, we obtain an output similar to that of the md5sum -c command.

5. Conclusion

In this article, we explored how we can verify the MD5 checksum both for a single file and for a large number of files. In particular, we use the -c option of the md5sum command to check computed hash values against published ones. The command can accept a here-string as input or a text file containing a list of hash values and filenames to verify.

Alternatively, we can use the hashlib module in a Python script to achieve the same result. The script parses the expected hash values and filenames from a text file. Then, it compares the calculated hash values to the expected ones.

Full Archive

About Baeldung

Administration

Filesystems

Processes

Files

Scripting

Installation

Networking

Security