Authors Top

If you have a few years of experience in the Linux ecosystem, and you’re interested in sharing that experience with the community, have a look at our Contribution Guidelines.

1. Overview

In this article, we’ll talk about different ways to compare binary files in Linux. We may need this when investigating different files for data recovery, reverse engineering, and other programming problems.

2. Problem Statement

To exemplify the problem of this article, we first need two binary files to start with. Then, we’ll compare these two binary files. We can generate them by running the echo command with two options. The flag -n prevents the output of the trailing newline. Moreover, the -e flag uses the hexadecimal values (\xHH) for the data we want in the files:

$ echo -n -e \\x41\\x2e\\x42\\x2e\\x43\\x2e\\x44\\x2e > binary_file_1.bin
$ echo -n -e \\x41\\x2e\\x41\\x2e\\x42\\x2e\\x42\\x2e > binary_file_2.bin

We need the double backslash to escape the backslash itself.

With the previous commands, we’ve created two binary files, binary_file_1.bin and binary_file_2.bin. In the first one, we’ve stored the string “A.B.C.D.” and in the second one, we have the string “A.A.B.B”. Thus, they differ in three characters: “BCD” versus “ABB”.

In this example, knowing the content of the files will help the discussion and ease the understanding of the tools. However, in real cases, the content is unknown: It is indeed what we want to discover!

3. Solutions

In general, we can split the problem into two parts. The first part consists of converting the binary information to something meaningful that we can compare. For this, we’ll use tools such as od. xxd, or hexdump.

The second part of the problem is actually comparing the information that we obtained before. There are multiple tools that we can use for that such as diff or vimdiff. We show these tools in the first three methods, although we can use them in other combinations to further customize the output and formatting.

The last and fourth solution presented inverts this process. It first compares the files and then converts the binary information to strings. The tools for this solution are cmp and gawk.

3.1. od with diff

The od command (which stands for octal dump) can be used to convert a binary file to a hexadecimal file. Thus, we can proceed with our two files as:

$ od -tx1 -v binary_file_1.bin > hexa_file_1.hd 
$ od -tx1 -v binary_file_2.bin > hexa_file_2.hd 

We have now two hexadecimal files, created from our binary ones. The -v flag prevents asterisk repetition for line suppression (which could present a problem with diff).

The od command is used with the -tx1 flag to specify the format (requested with -t) of hexadecimal (x) with just one (1) byte per block. Other common options are -tx2, which outputs two bytes per block in hexadecimal format and reverts the byte order in each block, and -to1, which outputs one byte per block in octal format.

If we expect byte addition and/or removal, there are other useful flags that can we can also use for improved formatting. These include -An, which removes the address column, and -w1, which puts just one byte per line (used to avoid a phase when using diff).

Once we have our hexadecimal files, we can compare them with diff as:

$ diff hexa_file_1.od hexa_file_2.od
1c1
< 0000000 41 2e 42 2e 43 2e 44 2e
---
> 0000000 41 2e 41 2e 42 2e 42 2e

We could’ve run both commands without the need of creating new files by using input redirection from one command to the other. To illustrate this and also show the -tx2 flag, we have another example where both features are included:

$ diff <(od -tx2 -v binary_file_1.bin) <(od -tx2 -v binary_file_2.bin)
1c1
< 0000000 2e41 2e42 2e43 2e44
---
> 0000000 2e41 2e41 2e42 2e42

Note that, although we get the same information, the bytes of each block of two bytes are inverted!

3.2. hexdump with diff

Following the last snippet where we used input redirection, we can combine diff with hexdump to compare the binary files:

$ diff <(hexdump binary_file_1.bin) <(hexdump binary_file_2.bin)
1c1
< 0000000 2e41 2e42 2e43 2e44 
---
> 0000000 2e41 2e41 2e42 2e42 

The result is the same as with the previous solution, displaying the consistency of this approach.

One option common for diff is the flag -y (or the longer version –side-by-side) which displays both files, one next to the other, improving comprehension for multiple-line files:

$ diff -y <(hexdump binary_file_1.bin) <(hexdump binary_file_2.bin)
0000000 2e41 2e42 2e43 2e44 | 0000000 2e41 2e41 2e42 2e42 
0000008                       0000008

3.3. xxd with diff

We can also replace hexdump (or od) with xxd:

$ diff -y <(xxd binary_file_1.bin) <(xxd binary_file_2.bin)
00000000: 412e 422e 432e 442e A.B.C.D. | 00000000: 412e 412e 422e 422e A.A.B.B.

This results in another different output from before. However, by inspection, we can check that the order in the block of two bytes has been inverted: With hexdump, we have 2e-41, while with xxd, we have 41-2e. This is referred to as endianness, and it’s one of the differences between hexdump and xxd.

We can see that after each string of hexadecimal, we’re also obtaining the ASCII representation of the text with which we began. This is something that can be achieved with the -C flag in hexdump.

We could replace diff with vimdiff to navigate the file for easier comparison (in this and all previous solutions):

$ vimdiff <(xxd binary_file_1.bin) <(xxd binary_file_2.bin)

3.4. cmp with gawk

In this last solution, we first compare the two binary files byte by byte with the cmp tool:

$ cmp -l binary_file_1.bin binary_file_2.bin
3 102 101
5 103 102
7 104 102

As we were expecting, there are three locations where the two files differ. We’ve used the -l flag to display where the files are different because the basic mode of cmp is to just state whether the files differ or not.

However, the output is in byte information, and it’s easier to deal with hexadecimal information. Thus, we can use the gawk function to convert the second and third columns to hexadecimal:

$ cmp -l binary_file_1.bin binary_file_2.bin | gawk '{printf "%08X %02X %02X\n", , strtonum(0), strtonum(0)}'
00000003 42 41
00000005 43 42
00000007 44 42

3.5. Which Solution to Choose?

As we’ve seen, the results from the different solutions are equivalent and return similar results. Therefore, we should choose based on two main factors: if we have the required tools installed in our system or not and the desired formatting of the output.

4. Conclusion

In this article, we’ve discussed different solutions to compare the content of two binary files. After defining the example case, we’ve presented four solutions with their commands and how to interpret their output.

Authors Bottom

If you have a few years of experience in the Linux ecosystem, and you’re interested in sharing that experience with the community, have a look at our Contribution Guidelines.

Comments are closed on this article!