1. Overview

In this tutorial, we’ll learn how to quickly extract values from XML (Extensible Markup Language) tags using the command line. We’ll go through a few handy utilities that make this process easier. Finally, we’ll use the Perl programming language for the job.

Moreover, we’ll be processing a simple RSS (Really Simple Syndication) XML document with several different tags:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
    <channel>
        <title>Baeldung on Linux</title>
        <link>http://baeldung.com/linux</link>
        <description>A simple RSS feed.</description>
        <language>en-us</language>
        <item>
            <title>Extract Values from XML Tags</title>
            <link>http://baeldung.com/linux/extract-xml-tags</link>
            <guid>7EE7D60F-95A5-48FB-A15F-3EF2CE7A4321</guid>
            <pubDate>Fri, 15 Dec 2023 00:00:00 GMT</pubDate>
            <author>Haidar Ali</author>
            <description>This article explains how to extract values from XML tags using various techniques.</description>
        </item>
    </channel>
</rss>

2. xmllint

xmllint is a utility that parses and validates XML documents. In addition, it’s also capable of pretty-printing to make the documents more readable.

2.1. XPath Support

A neat feature of xmllint is the support for XPath. XPath is a query language that makes it easier to query XML documents to retrieve information like tag values and attributes.

It uses an expression that lets us navigate through the XML document, which is similar to navigating paths in UNIX. For instance, if we want to query the title tag inside head, we use /html/head/title. It returns the title node itself and not its text content.

2.2. Installing xmllint

By default, xmllint isn’t installed on most Linux distros. However, it’s available in the official package repositories:

# Debian, Ubuntu, and derivatives
sudo apt install -y libxml2-utils

# Fedora, Red Hat, and CentOS Stream
sudo dnf install -y libxml2

# OpenSUSE and derivatives
sudo zypper install --non-interactive libxml2-tools

# Arch Linux and derivatives
sudo pacman -S --noconfirm libxml2

Once installed, let’s verify it:

$ xmllint --version
xmllint: using libxml version 20914

In the next sections, we’ll use xmllint to extract values from the XML document.

2.3. Extract Values

We extract values and tags from XML documents by using the –xpath option:

$ xmllint --xpath '/rss/channel/item/title' rss.xml
</title>Extract Values from XML Tags</title>

–xpath requires an XPath string. As we can see, it prints the value as well as the containing element. However, we can also extract the text content inside the title element:

$ xmllint --xpath 'string(/rss/channel/item/title)' rss.xml
Extract Values from XML Tags

In the –xpath option, we specified the string function and wrapped the XPath in it. It effectively omits the element and prints out the actual text.

Moreover, if we omit the parent elements, we get an empty result:

$ xmllint --xpath 'string(/item/title)' rss.xml

The XPath expression needs to be accurate so that it reflects the actual structure of the XML document. Therefore, we can’t take shortcuts when specifying absolute paths. Conversely, if the document is too complex, we use a relative path:

$ xmllint --xpath 'string(//item/title)' rss.xml
Extract Values from XML Tags

In the command, we prefixed the path with “//” to indicate that the following path is relative. Furthermore, we can also query elements by attributes:

$ xmllint --xpath '//rss[@version="2.0"]/channel/title' rss.xml
<title>My RSS Feed</title>

In the expression, we select the rss element that has its version attribute set to 2.0, and then we provide the rest of the path.

In the next section, we look at how to extract attribute values.

2.4. Extract Attributes

We extract attributes from XML elements by prefixing the attribute name with an @:

$ xmllint --xpath '//rss/@version' rss.xml 
 version="2.0"

In the expression, the final component of the path is the attribute name, which belongs to the top-level rss element. We can also see that it prints the attribute name. Again, we use the string function to omit that:

$ xmllint --xpath 'string(//rss/@version)' rss.xml 
2.0

3. xmlstarlet

xmlstarlet is a comprehensive XML processor. Like xmllint, it also provides XPath support.

By default, it’s not installed on most Linux distributions. However, it’s available on most package repositories:

# Ubuntu, Debian, and derivatives
sudo apt-get install -y xmlstarlet

# Fedora, Red Hat, and CentOS Stream
sudo dnf install -y xmlstarlet

# OpenSUSE and derivatives
sudo zypper install --non-interarcive xmlstarlet

# Arch Linux and derivatives
sudo pacman -S --noconfirm xmlstarlet

Once installed, let’s verify it:

$ xmlstarlet --version
1.6.1

3.1. Extract Values

xmlstarlet lets us query elements with an XPath:

$ xmlstarlet sel -t -v "//channel/title" rss.xml 
My RSS Feed

Let’s break this down:

  • sel specifies that we’re querying data from an XML document
  •  -t sets the output format to text
  • -v prints out the value of the queried element
  • //channel/title is the relative XPath expression
  • rss.xml is the input document

Conversely, we print out the entire element with -c:

$ xmlstarlet sel -t -c "//channel/title" rss.xml 
<title>My RSS Feed</title>

3.2. Extract Attributes

In the same way, we extract attributes by prefixing the attribute with an @:

$ xmlstarlet sel -t -v "//rss/@version" rss.xml
2.0

Similarly, we can also print all attributes of an element:

$ xmlstarlet sel -t -m "//rss/@*" -v "concat(name(), '=', .)" -n rss.xml
version=2.0

Let’s dig into this:

  • -m selects all attributes of the rss element
  • -v “concat(name(), ‘=’, .)” combines the attribute name and its value
    • name is a function that prints attribute name
    • = is the separator that we can replace with other letters
    • . prints the attribute value
  • -n adds a newline after each attribute for readability

Since we have a single attribute for rss, it prints only that. However, if there are more attributes, it prints each of them:

$ echo '<rectangle width="12" height="28"></rectangle>' | \
   xmlstarlet sel -t -m "//rectangle/@*" -v "concat(name(), '=', .)" -n
width=12
height=28

4. Perl

On most Linux distributions, we have Perl pre-installed. Perl effortlessly cuts through and transforms text with powerful features like concise syntax, regular expression support, and efficiency. 

Let’s extract the title element from our document:

$ perl -MXML::Twig -e 'my $twig=XML::Twig->new(); \
  $twig->parsefile("rss.xml"); \
  $_->print for $twig->findnodes("//item/title");'
<title>Extract Values from XML Tags</title>

Let’s break this down:

  • -MXML::Twig loads the XML Twig module for XML processing
  • -e enables us to write a Perl one-liner in a shell
  • my $twig=XML::Twig->new(); creates a new Twig instance and assign it $twig
  • $twig->parsefile(“rss.xml”); parses the rss.xml file using the $twig instance
  • $_->print for $twig->findnodes(“//item/title”); iterates over each title element and prints it out

Similarly, we can extract values as well:

$ perl -MXML::Twig -e 'my $twig=XML::Twig->new(); \
  $twig->parsefile("rss.xml"); \
  print $_->text for $twig->findnodes("//item/title");'
Extract Values from XML Tags

The last statement uses $_->text that prints out the text content of the title element. In contrast, $_->print prints out the entire element.

In addition, we can also print the attribute values:

$ perl -MXML::Twig -e 'my $twig=XML::Twig->new(); \
  $twig->parsefile("rss.xml"); \
  print $_->att("version") for $twig->findnodes("//rss");'

2.0

Notably, the last statement prints out the attribute value of each rss element. For that, we’re using the $_->att() function. It expects an attribute name, which in this case is “version“.

5. Conclusion

In this article, we discussed how to process an XML document to extract tag values and attributes. For that purpose, we made use of xmllint and starletxml.

In addition, we also looked at Perl, which lets us be more flexible when it comes to text processing.