1. Introduction

Extracting a URL from a string is a fairly common task when it comes to text processing in the shell. Still, due to the nature of text data, doing so doesn’t have guaranteed results. As usual with string matching, we can increase our chances of a positive match and decrease false positives and false negatives by recognizing the structure of the processed data.

In this tutorial, we explore ways to recognize a URL within HyperText Markup Language (HTML) files. First, we talk about the parsing conditions. Next, we delve into the structure of an HTML file. After that, we look at specific parts of such files where we might expect to find a URL. Finally, we consider how to ensure the extracted data is in fact a URL.

In this tutorial, we tested the code on Debian 12 (Bookworm) with GNU Bash 5.2.15. It should work in most POSIX-compliant environments unless otherwise specified.

2. Matching Context

Notably, while we only consider the HTTP(S) protocol in our examples, they do apply to other URI formats as well. Still, we only search for an absolute URI with a protocol specification prefix.

Further, each regular expression (regex) we look at should be run with three modifier flags to ensure the optimal results:

  • m: match against multiple lines, equating ^ and $ to the start or end of any line
  • g: global match, searching for all instances
  • i: case-insensitive match (mainly for HTML names)

How these are applied depends on the regex flavor. For example, we use perl and PCRE where we can do both (?flags) and /flags:

$ cat file.html | perl -n0we 'foreach (/<REGEX>/gmi) { print; print "\n"; }'

Here, we pipe the contents of file.html to the perl interpreter, which [e]xecutes the code for the wh[0]le file at once with [n]o line printing but still issuing [w]arnings if necessary. The code performs a REGEX globally and prints each match followed by a new line.

3. HTML Files

The HyperText Markup Language (HTML) is designed to format content to be displayed in Web browsers. As such, it has a fairly basic syntax.

HTML comprises tags surrounded by <> angle brackets:

<tag>

Tags come in pairs with the ending tag having a / slash after the < opening angle bracket:

<tag></tag>

Such a pair of matching begin and end tags together with the data between them is also sometimes called an element.

In addition, a tag can have attributes in the form of key-value pairs:

<tag attribute1="value" attribute2="1">

Notably, we rely on the double quotes that surround the attribute values. However, this is the classic way to write and, more importantly, generate HTML code.

Between each set of tags, we can have any type of content, including nested tags:

<tag>Content and <nested>tag data</nested>.</tag>

An HTML file or document has a relatively basic structure:

$ cat file.html
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>title</title>
    <link rel="stylesheet" href="style.css">
    <script src="script.js"></script>
  </head>
  <body>
    <!-- page content -->
  </body>
</html>

Its first line can be a DOCTYPE declaration that indicates its type as html:

<!DOCTYPE html>

Normally, HTML files begin with a top-level html tag, which optionally has a [lang]uage attribute:

<html lang="en">

Under the html tag, we have the head tag pair, which surrounds a number of declarations, followed by the body tags, which contain the main page data.

4. HTML Tags With URL Attributes

Of course, like any other text file, HTML documents can have a URL at any point in the data stream between the structuring tags. That’s not much different than the regular URL-within-string search that we can do with many tools.

Yet, knowing about HTML tags, we can have fair expectations about the location of a URL within the metastructure.

Naturally, two of the most obvious places to look for a URL are the [a]nchor and link tags. In particular, we look for their href attribute:

<a href="https://gerganov.com">x</a>

So, let’s specialize our URL match regex by first finding href instances:

<(?:a|link) [^>]*href="(.*?)"

Here, we use an (?:) non-capture group with | alternatives to list the tag names we target and () capture group to extract the entire contents of their respective attribute, href in this case.

Yet, what we extract here can be one of many URL schemes supported by the browser:

  • page sections with document fragments
  • text fragments
  • media files and media fragments
  • tel: telephone numbers
  • mailto: email addresses
  • sms: SMS

Not all of them are full HTTP(S) addresses.

4.2. action: form

Similarly to href, the action attribute of the form element:

<(?:form) [^>]*action="(.*)">

Also like href, action doesn’t always use a full HTTP(S) URL. In fact, while it can, it rarely does.

4.3. src: img, iframe, script, source, audio, video, input

Now, let’s turn to the src attribute that many tags support:

  • img: URL of image
  • iframe: URL of page to embed or about:blank and similar
  • script: URL of script
  • source: URL of media resource
  • audio: URL of audio
  • video: URL of video
  • input: URL of image for image button

Equivalently, we can construct the respective regular expression:

<(?:img|iframe|script|source|audio|video|input) [^>]*src="(.*?)"

Although most values of src should be valid as a URL, it’s not guaranteed.

4.4. Other Tags

Naturally, there are more tags that support URL attributes. In fact, since it’s up to the Web browser, HTML parser, and developer, any attribute could contain a URL. As long as we know the specific ones we might be interested in as well as the attribute, we can narrow down the search as we did above:

<(?:tag1|tag2|...|tagN) [^>]*attribute="(.*?)"

For example, the source and img tags also have a srcset attribute, which can contain not only one, but multiple URL strings, separated by commas and other information. Of course, this might require more specialized processing outside the scope of this article.

Still, let’s explore the basic case of recognizing a single URL within the captured data from any of the regular expressions above.

5. Proper HTML Parsing

Although we’ve shown that regular expression matching within HTML content is possible and could deliver results, it’s a strongly discouraged practice.

Because of this, it’s usually a better idea to employ a language like Perl for HTML processing:

$ perl -MMojo::DOM -ne '
  local $/;
  my $html = <>;
  my $dom = Mojo::DOM->new($html);
  print $dom->at("a")->attr("href");
' <<< '
  <a href="https://gerganov.com">Link Text</a>
'

In this case, we use the Mojo::DOM [M]odule to process the here-string that represents HTML content with only a single a element. After parsing the data, we focus at the a tag and get its href [attr]ibute. This way, we avoid the regular expressions and use a more robust way to get potential URL data.

6. Extract URL From String

At this point, let’s come up with an effective way to match a URL within any type of string.

As usual, one of the best methods to do so involves a regular expression:

https?:\/\/(www\.)?([a-zA-Z0-9%][-a-zA-Z0-9@:%._\+~#=]{0,256}|)[a-zA-Z0-9]\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)

Let’s break this down:

+---------------------------------+---------------------------------------------------------------------------------------------+
| Regex Part                      | Function                                                                                    |
|---------------------------------+---------------------------------------------------------------------------------------------|
| https?:\/\/                     | match starts with either http:// or https://                                                |
|---------------------------------+---------------------------------------------------------------------------------------------|
| (www\.)?                        | allow for both classic World Wide Web (WWW) and other subdomains                            |
|---------------------------------+---------------------------------------------------------------------------------------------|
| ([a-zA-Z0-9%]                   | first domain matchgroup alternative and expects the first domain name character             |
|                                 | to be alphanumeric or % (encoded) and rest is                                               |
| [-a-zA-Z0-9@:%._\+~#=]{0,256}   | at most 255 dash, alphanumeric, @: credential, : port separator, percent encoding,          |
|                                 | . subdomain, query parameter, and anchor characters                                         |
|---------------------------------+---------------------------------------------------------------------------------------------|
| |)                              | second and last domain name matchgroup alternative, allows the match group to match nothing |
|---------------------------------+---------------------------------------------------------------------------------------------|
| [a-zA-Z0-9]                     | forces the domain to end with an alphanumeric character                                     |
|---------------------------------+---------------------------------------------------------------------------------------------|
| \.                              | literal dot separator between the domain and top-level domain (TLD) parts of the URL        |
|---------------------------------+---------------------------------------------------------------------------------------------|
| [a-zA-Z0-9()]{1,6}              | expects the TLD to comprise of 1-6 alphanumeric characters or parentheses                   |
|---------------------------------+---------------------------------------------------------------------------------------------|
| \b                              | word boundry separator between the TLD and the following path                               |
|---------------------------------+---------------------------------------------------------------------------------------------|
| ([-a-zA-Z0-9()@:%_\+.~#?&//=]*) | right end of the URL, allowing for a number of characters in a greedy match                 |
+---------------------------------+---------------------------------------------------------------------------------------------+

Although comprehensive, there are a couple of obvious pitfalls to this way of matching:

  • main anchor point is http[s]://
  • no defined start or end

This means that we might match something very unexpected like binary or incomplete data. At the same time, the greedy match at the end could overflow the match to much more than the URL.

Because of this, it’s usually good practice to include anything we expect from the formatting of the matched data.

7. Summary

In this article, we talked about extracting a URL from an HTML file.

In conclusion, finding and matching a URL in structured data such as an HTML file is generally more reliable than doing the same within an arbitrary string of data.

Comments are closed on this article!