HTML Parsing for Extracting Text Between HTML Tags in the Shell

1. Introduction

The HyperText Markup Language (HTML) uses start tags and end tags in text form to surround elements.

In this tutorial, given a pair of start and end HTML tags, we’ll discuss methods to extract the data between them. First, we talk about HTML preprocessing. After that, we go over general parsing considerations. Next, we go over HTML parsing with standard shell tools. Then, we focus on interpreters for the task at hand. Finally, we briefly mention a purpose-built tool for HTML parsing.

For brevity and clarity, we use process substitution for supplying the input text.

We tested the code in this tutorial on Debian 11 (Bullseye) with GNU Bash 5.1.4. It should work in most POSIX-compliant environments.

2. HTML Tidy

Some languages are flexible and enable us to convey identical meanings via different constructs. Yet others have the equivalent of the Perl language strict pragma so that they can have both loose and more thorough rules.

Due to these and other reasons, syntax standardization might be involved for a number of languages. For instance, when using HTML, the de facto standard is HTML Tidy with its tidy tool:

$ tidy --tidy-mark no --doctype omit <(echo '<html><head></head></html>') 2>/dev/null
<html>
<head>
<title></title>
</head>
<body>
</body>
</html>

Here, we avoid the addition of a metatag for the –tidy-mark and omit the inclusion of a –doctype (!DOCTYPE). In addition, we only print the resulting HTML without any stderr error output.

Bypassing our HTML through tidy, we ensure a level of standardization for unified parsing.

3. Parsing Considerations

When parsing any type of text structure, it’s critical to consider potentially problematic constructs. Let’s informally discuss some common ones.

3.1. Newline and NULL

In general, whether there is or isn’t a newline within a given string matters in many contexts. For example, although even newlines can be in Linux paths, processing paths with such characters can present challenges.

This is one of the main reasons a number of standard NULL, i.e., \0, switches exist:

+-----------+-----------------------+--------------+------------------------------------------+
|  Command  |        Option         | Input/Output |                 Function                 |
+-----------+-----------------------+--------------+------------------------------------------+
| find      | -printf0              | output       | append NULL to each path                 |
| sort      | -z, --zero-terminated | input/output | use NULL as the line delimiter           |
| bash      | IFS=$'\x00'           | input/output | use NULL as the internal field separator |
| perl      | -0                    | input/output | use NULL as the record separator         |
| sed       | -z, --null-data       | input/output | use NULL as the line separator           |
| grep      | -Z, --null            | output       | append NULL instead of newline           |
|           | -z, --null-data       | input        | use NULL as the input separator          |
| awk       | BEGIN{FS="\x00"}      | input        | use NULL as the field separator          |
| xargs     | -0, --null            | input        | use NULL as the only input separator     |
| read      | -d ''                 | input        | use NULL as the delimiter                |
| readarray |                       |              |                                          |
| mapfile   |                       |              |                                          |
| cut       |                       |              |                                          |
+-----------+-----------------------+--------------+------------------------------------------+

While this is a non-comprehensive list, it provides an idea about the mechanism. Further, the same options can be used for other delimiters, not only newlines.

3.2. Nesting

Often, certain parts of a text may require nesting. For example, there are the C-style curly braces, which surround blocks of code:

int main() {
  if (1 == 1) {
    printf("Truth.");
  }
  
  return 0;
}

Notably, we have a set of curly braces, which surround the body of the main() function, but a further set delimits the if instructions that are also part of main().

If we want to extract a syntactically correct code block, we need a mechanism to discern matching symbols that can nest. For instance, extracting from the first { to the first } would be incorrect for most purposes:

{
  if (1 == 1) {
    printf("Truth.");
  }

Another situation might be caused by nested HTML tags. When attempting to extract data between specific tags, how do we specify which ones we want, and how do we match the correct ones?

Obviously, considering such cases is critical.

3.3. Parsers

Finally and most importantly, many text formats have purpose-built parsers. This means that we may not need to reinvent a way to handle a given syntax.

On the other hand, such tools can be third-party applications and might not be built into our Linux distribution. Still, best practices dictate purpose-built parsers as the optimal way to handle a given type of syntax.

For example, there are methods that may not be recommended for parsing HTML in most circumstances. In fact, we continue with some of them.

4. Using Standard Shell Tools

The ubiquitous grep command comes with the –only-matching or -o switch, which only prints matching lines:

$ TAG='td'; grep --only-matching '<'$TAG'>.*</'$TAG'>' <(echo '
<'$TAG'>text1</'$TAG'>
<othertag>nonmatching</othertag>
<'$TAG'>text2</'$TAG'>'
)
<td>text1</td>
<td>text2</td>

Now, we can simply remove the tags themselves with sed:

$ TAG='td'; grep --only-matching '<'$TAG'>.*</'$TAG'>' <(echo '
<'$TAG'>text1</'$TAG'>
<othertag>nonmatching</othertag>
<'$TAG'>text2</'$TAG'>'
) | sed 's/\(<'$TAG'>\|<\/'$TAG'>\)//g'
text1
text2

Yet, HTML tags may not begin or end on the same line. In addition, there may be other tags on given lines.

On the other hand, we can also have a go at the extraction with awk:

$ TAG='td'; awk 'BEGIN{open=0;}
{
  if($0 ~ /'"<$TAG>"'/){open=open+1;}
  if(open>0){print $0;}
  if($0 ~ /'"<\/$TAG>"'/){open=open-1;}
}' <(echo '
<'$TAG'>text1</'$TAG'>
<othertag>nonmatching</othertag>
<'$TAG'>text2</'$TAG'>
<'$TAG'>
<othertag>matching</othertag>
paragraph
</'$TAG'>'
)
<td>text1</td>
<td>text2</td>
<td>
<othertag>matching</othertag>
paragraph
</td>

This awk code acts similarly to grep with –only-matching but considers open tags from previous lines as well, so nesting works across more than one line.

Notably, none of these solutions is optimal, especially due to their use of a regular expression (regex). Let’s turn to actual HTML parsing and scraping.

5. Using General Interpreters

Many programming language interpreters that are a default part of major Linux distributions have HTML-parsing modules either built-in or available to install.

Let’s explore some one-liners that leverage them.

5.1. Perl

With the perl interpreter, we can employ a number of cpan modules:

For our purposes, let’s install Mojo::DOM:

$ cpan install Mojo::DOM

Now, we use it by [-e]xecuting a [-n]on-printing one-liner with the Mojo::DOM [-M]odule pre-imported:

$ TAG='td'; perl -MMojo::DOM -ne '
  my $html = <>;
  my $dom = Mojo::DOM->new($html);
  $dom->find("'$TAG'")->each(sub { print $_->text."\n"; });
' <(echo '
<'$TAG'>text1</'$TAG'>
<othertag>nonmatching</othertag>
<'$TAG'>text2</'$TAG'>'
)
text1
text2

First, we get the contents of the HTML. Next, we parse them. After that, we find each instance of the tag we need and print its inner contents as text.

5.2. Python

As usual, python also offers ways to parse HTML. A common one is the BeautifulSoup library.

First, we install it with the bs4 or beautifulsoup4 package via pip:

$ pip3 install bs4

Now, we can run a Python one-liner [-c]ommand:

$ TAG='td'; python3 -c '
import sys
from bs4 import BeautifulSoup

html = open(sys.argv[1]).read()
dom = BeautifulSoup(html, "html.parser")

for tag in dom.find_all("'$TAG'"):
  print(tag.text)
' <(echo '
<'$TAG'>text1</'$TAG'>
<othertag>nonmatching</othertag>
<'$TAG'>text2</'$TAG'>'
)
text1
text2

Just like with Perl, we read the whole HTML content and pass it to the BeautifulSoup() parser. After that, we print the text of each found tag.

5.3. Ruby

The ruby interpreter has an HTML parser gem called nokogiri. It’s commonly used for tasks such as scraping and data extraction.

Let’s install nokogiri via the gem tool:

$ gem install nokogiri

At this point, we can [-e]xecute a Ruby one-liner with the nokogiri [-r]equirement preloaded:

$ TAG='td'; ruby -rnokogiri -e '
html = readlines.join
dom = Nokogiri::HTML(html)

puts dom.xpath("//'$TAG'").map { |e| e.content }
' <(echo '
<'$TAG'>text1</'$TAG'>
<othertag>nonmatching</othertag>
<'$TAG'>text2</'$TAG'>'
)
text1
text2

In this case, we read the whole file and pass it to the Nokogiri::HTML parser. Next, we use an xpath to get all instances of the tag and map to get their content.

6. Using pup and xpup

Both pup and xpup can parse HTML and extract data. They use CSS selectors to extract information.

We can install pup using wget:

$ sudo wget https://github.com/ericchiang/pup/releases/download/v0.4.0/pup_v0.4.0_linux_amd64.zip

This command downloads the pup tool as a zip file on our system. Once this is done, we need to extract the contents of the zip file and place them in the usr/local/bin directory:

$ sudo unzip pup_v0.4.0_linux_amd64.zip -d /usr/local/bin

At this point, pup now exists on our system.

Next, let’s install xpup using go:

$ go install github.com/ericchiang/xpup@latest

Above, we ensure that we install the latest version of xpup. After this, we add the path to the xpup tool in the .bashrc file:

export PATH=$PATH:~/go/bin

In this case, we declare PATH to make xpup accessible from any location in the terminal.

Now, since they’re strongly related, the XML-focused xpup has the same syntax as pup.

So, let’s show an example with pup:

$ TAG=td; pup --file <(echo '
<html><head></head><body>
<table>
<'$TAG'>text1</'$TAG'>
<othertag>nonmatching</othertag>
<'$TAG'>text2</'$TAG'>
</table>
</body></html>') $TAG
<td>
 text1
</td>
<td>
 text2
</td>

Here, we use the –file flag to supply our content. Critically, pup and xpup require a properly-formatted HTML instead of the snippets we used in earlier examples. In addition, we may need to apply further filtering to isolate just the contents, excluding the tags themselves.

7. Summary

In this article, we explored ways to extract the content between two HTML tags in the shell.

In conclusion, although there are many options to do so, parsing HTML in the shell is best done with proper parsers and parsing libraries.

Full Archive

About Baeldung

Administration

Filesystems

Processes

Files

Scripting

Installation

Networking

Security