Searching and Matching Base64 Strings via grep and Perl Regular Expressions

1. Introduction

Base64 is an encoding standard for representing non-ASCII data with ASCII characters. In particular, Base64 is defined in RFC 4648 – The Base16, Base32, and Base64 Data Encodings.

In this tutorial, we’ll talk about Base64 and techniques to find Base64-encoded strings within different types of data. First, we go over the Base64 basics. After that, we construct a way to match Base64 strings. Next, we cover finding such strings within structured data. Finally, we turn to more challenging unstructured data examples with different tools.

We tested the code in this tutorial on Debian 12 (Bookworm) with GNU Bash 5.2.15. It should work in most POSIX-compliant environments unless otherwise specified.

2. Base64 Basics

The Base64 encoding uses a limited set of characters from the ASCII table to represent any data sequence.

2.1. Encoding

To understand how Base64 encodes binary and other non-ASCII data, let’s use an example:

$ printf 'България' | base64
0JHRitC70LPQsNGA0LjRjw==

In this case, we employ the standard and ubiquitous base64 utility without any flags to encode the string Baeldung. as piped from the output of printf to stdin. Consequently, we see the representation of България in Base64 is 0JHRitC70LPQsNGA0LjRjw==.

Importantly, the = equals sign characters at the end are padding, since Base64 string lengths must be divisible by 4 because one input character is encoded with a maximum of four (4) Base64 ASCII output characters. The minimum number of Base64 ASCII characters required for an encoding is two (2), when we’re only encoding one byte of input, i.e., the smallest non-empty data chunk.

On the other hand, longer output strings get cut to 76 characters with a newline as per the MIME standard:

$ printf 'Това изречение е на български език' | base64
0KLQvtCy0LAg0LjQt9GA0LXRh9C10L3QuNC1INC1INC90LAg0LHRitC70LPQsNGA0YHQutC4INC1
0LfQuNC6

Importantly, newlines are removed and ignored when reading in a Base64 string for decoding but aren’t skipped in the input string when encoding. This is critical because it means the encoded string can have newlines, but they don’t represent actual data.

2.2. Decoding

To ensure the operation preserved the original data, let’s decode the result via base64 and its –decode (–d) flag:

$ printf '0JHRitC70LPQsNGA0LjRjw==' | base64 --decode
България
$ printf '0KLQvtCy0LAg0LjQt9GA0LXRh9C10L3QuNC1INC1INC90LAg0LHRitC70LPQsNGA0YHQutC4INC1
0LfQuNC6' | base64 --decode
Това изречение е на български език

When decoding, the algorithm follows a pattern of mapping 8-bit characters to Base64 characters:

        1       2       3                
8-bit:  111111112222222233333333
Base64: 111111222222333333444444 
        1     2     3     4

Since the 11111111 8-bit character encompasses the 111111 and 222222 Base64 characters, we verify that two Base64 characters are the absolute minimum for encoding any data.

Thus, we see the resulting strings match the original inputs from earlier.

2.3. Character Set

In practice, Base64 uses several character ranges for the translation:

A-Z
a-z
0-9
+
/
\n newline
= (only for padding at the end, when necessary)

We can call these the Base64-ASCII subset.

3. Base64 String Matching

Now, armed with the syntax and format of Base64 strings and the way we encode and decode them, we can construct a regular expression (regex) that finds potential matches for a Base64 string.

3.1. Base64 Regular Expression

To build our regular expression, we use POSIX Extended Regular Expressions (ERE) syntax:

((([A-Za-z0-9+/]{4})*)([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==))

Let’s break this down:

[A-Za-z0-9+\/] character group matches any Base64 character (the / forward slash is escaped with a backslash for safety)
numbers within {} curly braces specify how many times to expect a match for the last group or character
* asterisk matches the last group or character zero or more times
() parentheses denote a grouping
| pipe symbol separates alternative matches within a group or the whole regular expression
the literal = equals character matches itself

In other words, we match zero or more groups of four (4) Base64 characters, followed by either one such group, a group with three (3) Base64 characters and an equals sign, or a group with two (2) Base64 characters and two equals signs. Barring newlines, this regular expression should provide a reliable way to match Base64 strings in unstructured data.

If our environment supports Perl Compatible Regular Expressions (PCRE), we can enhance the above with lookahead and lookbehind:

((?<![A-Za-z0-9+\/])(([A-Za-z0-9+\/]{4})*)([A-Za-z0-9+\/]{4}|[A-Za-z0-9+\/]{3}=|[A-Za-z0-9+\/]{2}==)(?![=A-Za-z0-9+\/]))

Here, we add (?<![A-Za-z0-9+\/]) to prevent any Base64 character preceding our matched string. On the other hand, the (?![=A-Za-z0-9+\/]) group at the end avoids following the match with a Base64 character or =.

3.2. Newlines

Notably, with the regular expression above, we don’t consider newlines as connectors of a single Base64 string. Thus, we can break some valid matches, introducing false negatives. On the other hand, preprocessing by removing line breaks can lead to false positives.

How this is handled depends on the application, data, and implementation decisions.

4. Detect Base64 Strings in Structured Data

When searching through data, the structure of that data is critical for picking the optimal method.

4.1. Mail Messages

Base64 strings often appear in e-mail messages with Multipurpose Internet Mail Extensions (MIME):

Delivered-To: [email protected]
Received: by 2002:abe:0eba:0:0:0:0:0 with SMTP id cp7csp6661001ecb;
        Mon, 20 Oct 2020 04:44:44 -0800 (PST)
X-Received: by 2002:a01:6660:1010:: with SMTP id c9mr346662191ox.55.1666130043186;
        Mon, 20 Oct 2020 04:44:44 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1697777084; cv=none;
        d=gerganov.com; s=arc-20160816;
        b=ogoVPQUiHTUr1LpEnBqsJxUxA5L666By+akUAuK769l5Kr67BQ78WwS6QV7P/i5SpR
         A2+hbfjXn/ZYU29kyJ8sic40yEmBgWkht8XkSGg6ANQpGEiUiEyLO5bfkW6yQzL8BlIO
         QW10lNE8SDm4S7rE5R666xxThPWosAurobGcWCM3IjBAJ5elhdxI0pXol+3L9p2xAidd
         BM5zUe9VA8f0HM3MX4eQilcyM6hmrdBqSSOLqE6663e5ioazcg1z6ZkTMPh5KD1tOrd0
         7n5E91c0NL320bxJfQwffWTz9PLy01t2Eexefl+j/qkemPgjVS5zp+o6zo7f2rLf2uYn
         M8Mw==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=gerganov.com; s=arc-20160816;
        h=to:from:subject:message-id:feedback-id:reply-to:date:mime-version
         :dkim-signature;
        bh=rq61IfQwvEPEGAkwWOqrPBDo1qwY2aEc5eqH02LPDUY=;
        b=h0HXqYzF65mV0YbEsGg6ANQpGEiUiEyLO2bmSY2ZDAi2mo1qwY2aEc5eSJmuVHaBVA
         cnoePTmwi/1c0NL320bxJvEyKp96mCtGnFDrOe75PypZNWoE7YMYgsmOcKYpPn/R+ZB3
         h9S9RmkDPgXnqtEL9iEK5nwGzbZz+v8+4IqCinWBCe5ZYAoeHv2TWruqU7vFjnUGWokn
         qDWBX+kCRPnoxJfQwffWTz9PLy01te3AqDHLmmDIBT7POwHxYVBdwuyDsYccacNZ420K
         IeJBqfCsiqRi5LJpl+STY1XeZe+N18rlexCmrm+u3gZkzAeBI/9fqQJi2TX2PHGRwBbJ
         GuDg==
[...]

So, we can see a clear pattern with field names followed by : colons, which may introduce further field = assignments such as b=, which introduces Base64 data. Knowing this, we can process accordingly:

remove unnecessary whitespace
extract each top-level field and value
extract each secondary field and value
match Base64 string

Of course, in this case, we might not even need to detect the actual Base64 string via a regular expression.

4.2. HTML

Due to their textual nature, HTML and other Web languages commonly make use of Base64:

<img src="data:image/png;base64,
            iVBORw0KGgoAAAANSUhEUgAAAEAAAABACAIAAAAlC+aJAAAAAXNSR0IArs4c6QAAAARnQU1BAACx
            jwv8YQUAAAAJcEhZcwAADsMAAA7DAcdvqGQAAAAjSURBVGhD7cEBDQAAAMKg909tDwcEAAAAAAAA
            AAAAAACcqwEwQAABDGBv9QAAAABJRU5ErkJggg==" alt="Void" />

In this case, we have a Base64-encoded PNG image within an <img> tag. Evidently, we can see that the actual Base64 string is within quotes as the value of the src attribute.

In summary, we don’t need special matching to remove the whitespace from and extract this data.

4.3. PDF

When dealing with PDF files, text can appear within Tj and TJ elements. Stripping such elements can be necessary when a Base64 string is split between several of them.

So, as long as we have good delimiters, i.e., structured data, Base64 should be easier to find and define. In these cases, we can just match against the regular expression to verify we’re indeed dealing with Base64.

5. Detect Base64 Strings in Unstructured Data

While Base64 often appears as part of structured information, different situations may present another context:

recovered data
mixed binary and text data
dumps
manually compiled data
metadata

In such instances, regular expressions are usually our best option. Although more universal, looking for Base64 strings this way is prone to more false positives.

Let’s see a basic example with an HTML dump of a Web page:

<html dir="ltr" lang="en-US" prefix="og: https://ogp.me/ns#" class="no-js cshppvwrw idc0_349"><head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
[...]
<div class="bd-anchor" id="1-encode"></div>
<p>In general, we can use the <a href="/linux/base64-encode-image#1-using-the-base64-utility"><em>base64</em></a> command to encode a <a href="/linux/bash-string-manipulation">string</a>:</p>
<pre class="hljs-copy-wrapper" style="--hljs-theme-background: rgb(250, 250, 250);"><code class="language-bash hljs">$ <span class="hljs-built_in">echo</span> -n <span class="hljs-string">'Hello, World!'</span> | <span class="hljs-built_in">base64</span>
SGVsbG8sIFdvcmxkIQ==</code><button class="hljs-copy-button" data-copied="false">Copy</button></pre>
<p>In this case, we <a href="/linux/anonymous-named-pipes#pipes">pipe</a> the result to the <em>base64</em> command which performs the encoding. Notably, <strong>we’ve used the <em>-n</em> flag with <em><a href="/linux/echo-command">echo</a></em> to prevent adding a trailing <a href="/linux/line-endings-configure-bin-sh-bad-interpreter#line-endings">newline</a> character to the string before performing the Base64 encoding</strong>. Alternatively, we can replace <em>echo</em> with <a href="/linux/printf-echo#printf"><em>printf</em></a> to get the same output without extra switches.</p><div class="code-block code-block-2" style="margin: 8px 0; clear: both;">
[...]
</html>

Even though HTML is a structured programming language, we can barely see the SGVsbG8sIFdvcmxkIQ== Base64 string.

5.1. Perl

Let’s see the result from applying our regular expression via Perl to this code snippet as stored in the base64.html file:

$ perl -ne 'while (/(?<![A-Za-z0-9+\/])(([A-Za-z0-9+\/]{4})*)([A-Za-z0-9+\/]{4}|[A-Za-z0-9+\/]{3}=|[A-Za-z0-9+\/]{2}==)(?![A-Za-z0-9+\/])/g) { print $&."\n"; }' base64.html
html
dir=
idc0
head
meta
charset=
[...]
SGVsbG8sIFdvcmxkIQ==
hljs
copy
data
Copy
/pre
this
case
/linux/anonymous
[...]

In this basic one-liner, we print all matches of our regular expression within the contents of base64.html, each on a separate line. Notably, there are many parts of the content, such as html, this, /linux/anonymous, and others that are obviously not Base64 strings.

5.2. grep

Let’s leverage grep directly, just like Perl:

$ grep -Po '(?<![A-Za-z0-9+\/])(([A-Za-z0-9+\/]{4})*)([A-Za-z0-9+\/]{4}|[A-Za-z0-9+\/]{3}=|[A-Za-z0-9+\/]{2}==)(?![A-Za-z0-9+\/])' base64.html
html
dir=
idc0
head
meta
charset=
[...]
SGVsbG8sIFdvcmxkIQ==
hljs
copy
data
Copy
/pre
this
case
/linux/anonymous
[...]

In fact, we see the same results, since the nature of our search lies in the regular expression, which is equivalent.

Depending on whether we can tolerate more false positives or more false negatives, we can modify our regular expression and preprocessing. This can be based on different rules and conditions:

expecting certain surrounding characters like whitespace or angle brackets
expecting a certain minimum or maximum length of the Base64 string
removing whitespaces
other characteristics of the data

As the case usually is with unstructured data, any piece of information can help narrow down our search, but how we get it depends on the case.

6. Summary

In this article, we talked about Base64 and methods to find Base64 strings via different means.

In conclusion, although structured data often presents ways to extract certain kinds of information like a Base64 string, unstructured data may be challenging to sift through, even with regular expressions.

Full Archive

About Baeldung

Administration

Filesystems

Processes

Files

Scripting

Installation

Networking

Security