1. Introduction

Binary files and non-ASCII files can be challenging to work with because they can contain data that isn’t in a human-readable format. However, we might need to search for specific strings or patterns in these types of files, particularly in software development or system administration processes.

Luckily, there are several powerful command-line tools available in Linux that can help us. In this tutorial, we’ll discuss how to use some of these tools.

2. Preparing the Binary File

The first thing we’ll do is prepare a binary file that will serve as our working file throughout this tutorial.

In order to create an executable binary, let’s create a simple C program, baeldung.c, and compile it using the GCC compiler:

$ echo '#include <stdio.h>
 int main() { 
     return 0;
 }' > baeldung.c && gcc -o baeldung baeldung.c

This produces a binary file named baeldung. Let’s take a look at it with cat to see what it includes:

$ cat baeldung
��GNU��e�m= ``�-�=�=X`�-�=�=�888 XXXDDS�td888 P�td   DDQ�tdR�td�-�=�=HH/lib64/ld-linux-x86-64.so.2GNU�GNU��Q�/�{I��7C
            Y h "libc.so.6puts__cxa_finalize__libc_start_mainGLIBC_2.2.5_ITM_deregisterTMCloneTable__gmon_start___ITM_registerTMCloneTableu�i	1�@�?�?�?�?�?�?��H�H��/H�H�=��R/��H�=y/H�r/H9�tH�./H��t%�����H�=I/H�5B/H)�H��H��?H��H�H��tH�/H����fD�����=/u+UH�=�.H��t
                                                                                                                           ����/D$4���� FJ
D�h���eF�I�E �E(�D0�H8�G@n8A0A(B BB�����@
 �?��������o���o���o����o�=@GCC: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.08X|���	

As we can clearly observe, the content of the output is mostly not human-readable and comprises non-ASCII characters.

3. Using the strings Command to Extract the Text

Among the most common approaches for extracting human-readable text from binary files is using the strings command-line utility. With strings, we can easily get the printable characters inside a binary file:

$ strings baeldung
GCC: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0


The strings command has several useful options we can utilize:

  • -a scans the whole file instead of only specific sections in object files, but this doesn’t make any difference in our executable binary
  • -f prints the file names before each text line
  • -n helps us specify the minimum length of a sequence of characters to be printed (the default is 4 characters)
  • -t prints the offset of the string before each text in a specified format (d means decimal)

Let’s get only the strings that are at least 10 characters long in our binary and add the filename and offset data before each string:

$ strings -t d -f -n 10 baeldung
baeldung:     792 /lib64/ld-linux-x86-64.so.2
baeldung:    1152 __cxa_finalize
baeldung:    1167 __libc_start_main
baeldung:    1185 GLIBC_2.2.5
baeldung:    1197 _ITM_deregisterTMCloneTable
baeldung:    1225 __gmon_start__
baeldung:    1240 _ITM_registerTMCloneTable
baeldung:    4554 []A\A]A^A_
baeldung:    8196 BAELDUNG!!!
baeldung:   12304 GCC: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
baeldung:   13913 crtstuff.c
baeldung:   13924 deregister_tm_clones
baeldung:   13945 __do_global_dtors_aux
baeldung:   13967 completed.8061
baeldung:   13982 __do_global_dtors_aux_fini_array_entry
baeldung:   14021 frame_dummy
baeldung:   14033 __frame_dummy_init_array_entry
baeldung:   14064 baeldung.c
baeldung:   14075 __FRAME_END__


Now, we can find a specific string with the help of grep:

$ strings -t d baeldung | grep -i baeldung
   8196 BAELDUNG!!!
  14064 baeldung.c

As we can see above, we got the name of the C file and the string we wanted to print in the code.

4. Using the od Command

Similarly, we can use the od command to get the string in a binary file. Typically, od displays the contents of a file in octal, decimal, hexadecimal, or ASCII format.

However, we can attach the -S option to get the strings that contain at least a specified number of characters in our binary:

$ od -S 10 baeldung
0001430 /lib64/ld-linux-x86-64.so.2
0002200 __cxa_finalize
0002217 __libc_start_main
0002241 GLIBC_2.2.5
0002255 _ITM_deregisterTMCloneTable
0002311 __gmon_start__
0002330 _ITM_registerTMCloneTable
0020004 BAELDUNG!!!
0030020 GCC: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
0033131 crtstuff.c
0033144 deregister_tm_clones
0033171 __do_global_dtors_aux
0033217 completed.8061
0033236 __do_global_dtors_aux_fini_array_entry
0033305 frame_dummy
0033321 __frame_dummy_init_array_entry
0033360 baeldung.c
0033373 __FRAME_END__


Like before, we can pipe the output to grep to search for a specific string:

$ od -S 10 baeldung | grep -i baeldung
0020004 BAELDUNG!!!
0033360 baeldung.c

Notably, we get the same result with additional offset information in octal format. Moreover, we can change the offset format to decimal to match the format of our strings example:

$ od -S 10 -A d baeldung | grep -i baeldung
0008196 BAELDUNG!!!
0014064 baeldung.c

From the output above, we can verify that the results are the same, as we expected.

5. Invoking GNU grep Alone

To search for specific non-ASCII strings or patterns in binary files, we can use the grep command with the -a flag. This will instruct grep to handle the binary file as a text file, even if it includes non-ASCII characters.

Let’s find the same key string in our binary using grep:

$ grep -i -a -o 'baeldung' baeldung

Here, we included the -i option to enable case-insensitive search and -o to print only the matched parts.

We can also get the number of matches in our binary file with the -c option:

$ grep -i -a -c 'baeldung' baeldung

As shown here, the string baeldung appears twice in the binary file.

6. Combining hexdump and grep

To find the instances of a particular byte sequence in a binary file, we can combine the use of hexdump and grep commands.

We begin by using hexdump to convert the binary file to a hexadecimal dump. Then, we can utilize the output by piping it to grep to locate the desired byte sequence. To see the ASCII strings, we also need to add the -C option:

$ hexdump -C baeldung | grep -i baeldung
00002000  01 00 02 00 42 41 45 4c  44 55 4e 47 21 21 21 00  |....BAELDUNG!!!.|
000036f0  62 61 65 6c 64 75 6e 67  2e 63 00 5f 5f 46 52 41  |baeldung.c.__FRA|

From the output above, it’s clear that the string baeldung occurs twice in the executable file.

When we use hexdump, it might be difficult to find a long sequence if it wraps a line.

7. Using rabin2

rabin2 is a command-line utility that can analyze and extract information from binary files. It’s commonly used for reverse engineering, analyzing malware, and forensics analysis.

This tool might not be installed by default, so we can use a package manager to install it:

$ sudo apt install radare2

Simply, we can use the -zz flag to get the strings from the whole binary file:

$ rabin2 -zz baeldung
nth paddr      vaddr      len size section   type    string
0   0x00000034 0x00000034 4   10             utf16le @8\r@
1   0x00000318 0x00000318 27  28   .interp   ascii   /lib64/ld-linux-x86-64.so.2
2   0x00000471 0x00000471 9   10   .dynstr   ascii   libc.so.6
3   0x0000047b 0x0000047b 4   5    .dynstr   ascii   puts
4   0x00000480 0x00000480 14  15   .dynstr   ascii   __cxa_finalize
5   0x0000048f 0x0000048f 17  18   .dynstr   ascii   __libc_start_main
6   0x000004a1 0x000004a1 11  12   .dynstr   ascii   GLIBC_2.2.5
7   0x000004ad 0x000004ad 27  28   .dynstr   ascii   _ITM_deregisterTMCloneTable
8   0x000004c9 0x000004c9 14  15   .dynstr   ascii   __gmon_start__
9   0x000004d8 0x000004d8 25  26   .dynstr   ascii   _ITM_registerTMCloneTable
10  0x0000110b 0x0000110b 4   5    .text     ascii   u+UH
11  0x000011c9 0x000011c9 11  12   .text     ascii   \b[]A\A]A^A_
12  0x00002004 0x00002004 11  12   .rodata   ascii   BAELDUNG!!!
13  0x00002068 0x00002068 4   5    .eh_frame ascii   \e\f\a\b
14  0x000020a7 0x000020a7 5   6    .eh_frame ascii   :*3$
15  0x000020f9 0x000020f9 4   5    .eh_frame ascii   R\f\a\b
16  0x00003010 0x00000000 42  43   .comment  ascii   GCC: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
17  0x00003659 0x00000001 10  11   .strtab   ascii   crtstuff.c
18  0x00003664 0x0000000c 20  21   .strtab   ascii   deregister_tm_clones
19  0x00003679 0x00000021 21  22   .strtab   ascii   __do_global_dtors_aux
20  0x0000368f 0x00000037 14  15   .strtab   ascii   completed.8061
21  0x0000369e 0x00000046 38  39   .strtab   ascii   __do_global_dtors_aux_fini_array_entry
22  0x000036c5 0x0000006d 11  12   .strtab   ascii   frame_dummy
23  0x000036d1 0x00000079 30  31   .strtab   ascii   __frame_dummy_init_array_entry
24  0x000036f0 0x00000098 10  11   .strtab   ascii   baeldung.c
25  0x000036fb 0x000000a3 13  14   .strtab   ascii   __FRAME_END__


Above, we can see that the tool prints the strings with a bunch of handy information such as the address of the string, its size, and which section of the binary file it’s located in.

Furthermore, we can retrieve the strings from only the data section in the binary file using the -z option:

$ rabin2 -z baeldung
nth paddr      vaddr      len size section type  string
0   0x00002004 0x00002004 11  12   .rodata ascii BAELDUNG!!!

Now, it only outputs the string from the data section of the binary.

8. Conclusion

In this article, we learned how to extract human-readable text from binary files that contain non-ASCII characters. We started by preparing our binary executable file and then demonstrated how to use the strings utility, the de facto and most straightforward method for extracting text from binary files.

We also explored other tools such as the od command and discussed various options that can enhance the capabilities of these tools. Furthermore, we demonstrated how to use grep and hexdump tools to search strings.

Finally, we discussed the use of the rabin2 reverse engineering tool, which can provide us with valuable information such as strings and metadata associated with the binary file.