Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: March 18, 2024
In this tutorial, we’ll learn how to quickly extract values from XML (Extensible Markup Language) tags using the command line. We’ll go through a few handy utilities that make this process easier. Finally, we’ll use the Perl programming language for the job.
Moreover, we’ll be processing a simple RSS (Really Simple Syndication) XML document with several different tags:
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>Baeldung on Linux</title>
<link>http://baeldung.com/linux</link>
<description>A simple RSS feed.</description>
<language>en-us</language>
<item>
<title>Extract Values from XML Tags</title>
<link>http://baeldung.com/linux/extract-xml-tags</link>
<guid>7EE7D60F-95A5-48FB-A15F-3EF2CE7A4321</guid>
<pubDate>Fri, 15 Dec 2023 00:00:00 GMT</pubDate>
<author>Haidar Ali</author>
<description>This article explains how to extract values from XML tags using various techniques.</description>
</item>
</channel>
</rss>
xmllint is a utility that parses and validates XML documents. In addition, it’s also capable of pretty-printing to make the documents more readable.
A neat feature of xmllint is the support for XPath. XPath is a query language that makes it easier to query XML documents to retrieve information like tag values and attributes.
It uses an expression that lets us navigate through the XML document, which is similar to navigating paths in UNIX. For instance, if we want to query the title tag inside head, we use /html/head/title. It returns the title node itself and not its text content.
By default, xmllint isn’t installed on most Linux distros. However, it’s available in the official package repositories:
# Debian, Ubuntu, and derivatives
sudo apt install -y libxml2-utils
# Fedora, Red Hat, and CentOS Stream
sudo dnf install -y libxml2
# OpenSUSE and derivatives
sudo zypper install --non-interactive libxml2-tools
# Arch Linux and derivatives
sudo pacman -S --noconfirm libxml2
Once installed, let’s verify it:
$ xmllint --version
xmllint: using libxml version 20914
In the next sections, we’ll use xmllint to extract values from the XML document.
We extract values and tags from XML documents by using the –xpath option:
$ xmllint --xpath '/rss/channel/item/title' rss.xml
</title>Extract Values from XML Tags</title>
–xpath requires an XPath string. As we can see, it prints the value as well as the containing element. However, we can also extract the text content inside the title element:
$ xmllint --xpath 'string(/rss/channel/item/title)' rss.xml
Extract Values from XML Tags
In the –xpath option, we specified the string function and wrapped the XPath in it. It effectively omits the element and prints out the actual text.
Moreover, if we omit the parent elements, we get an empty result:
$ xmllint --xpath 'string(/item/title)' rss.xml
The XPath expression needs to be accurate so that it reflects the actual structure of the XML document. Therefore, we can’t take shortcuts when specifying absolute paths. Conversely, if the document is too complex, we use a relative path:
$ xmllint --xpath 'string(//item/title)' rss.xml
Extract Values from XML Tags
In the command, we prefixed the path with “//” to indicate that the following path is relative. Furthermore, we can also query elements by attributes:
$ xmllint --xpath '//rss[@version="2.0"]/channel/title' rss.xml
<title>My RSS Feed</title>
In the expression, we select the rss element that has its version attribute set to 2.0, and then we provide the rest of the path.
In the next section, we look at how to extract attribute values.
We extract attributes from XML elements by prefixing the attribute name with an @:
$ xmllint --xpath '//rss/@version' rss.xml
version="2.0"
In the expression, the final component of the path is the attribute name, which belongs to the top-level rss element. We can also see that it prints the attribute name. Again, we use the string function to omit that:
$ xmllint --xpath 'string(//rss/@version)' rss.xml
2.0
xmlstarlet is a comprehensive XML processor. Like xmllint, it also provides XPath support.
By default, it’s not installed on most Linux distributions. However, it’s available on most package repositories:
# Ubuntu, Debian, and derivatives
sudo apt-get install -y xmlstarlet
# Fedora, Red Hat, and CentOS Stream
sudo dnf install -y xmlstarlet
# OpenSUSE and derivatives
sudo zypper install --non-interarcive xmlstarlet
# Arch Linux and derivatives
sudo pacman -S --noconfirm xmlstarlet
Once installed, let’s verify it:
$ xmlstarlet --version
1.6.1
xmlstarlet lets us query elements with an XPath:
$ xmlstarlet sel -t -v "//channel/title" rss.xml
My RSS Feed
Let’s break this down:
Conversely, we print out the entire element with -c:
$ xmlstarlet sel -t -c "//channel/title" rss.xml
<title>My RSS Feed</title>
In the same way, we extract attributes by prefixing the attribute with an @:
$ xmlstarlet sel -t -v "//rss/@version" rss.xml
2.0
Similarly, we can also print all attributes of an element:
$ xmlstarlet sel -t -m "//rss/@*" -v "concat(name(), '=', .)" -n rss.xml
version=2.0
Let’s dig into this:
Since we have a single attribute for rss, it prints only that. However, if there are more attributes, it prints each of them:
$ echo '<rectangle width="12" height="28"></rectangle>' | \
xmlstarlet sel -t -m "//rectangle/@*" -v "concat(name(), '=', .)" -n
width=12
height=28
On most Linux distributions, we have Perl pre-installed. Perl effortlessly cuts through and transforms text with powerful features like concise syntax, regular expression support, and efficiency.
Let’s extract the title element from our document:
$ perl -MXML::Twig -e 'my $twig=XML::Twig->new(); \
$twig->parsefile("rss.xml"); \
$_->print for $twig->findnodes("//item/title");'
<title>Extract Values from XML Tags</title>
Let’s break this down:
Similarly, we can extract values as well:
$ perl -MXML::Twig -e 'my $twig=XML::Twig->new(); \
$twig->parsefile("rss.xml"); \
print $_->text for $twig->findnodes("//item/title");'
Extract Values from XML Tags
The last statement uses $_->text that prints out the text content of the title element. In contrast, $_->print prints out the entire element.
In addition, we can also print the attribute values:
$ perl -MXML::Twig -e 'my $twig=XML::Twig->new(); \
$twig->parsefile("rss.xml"); \
print $_->att("version") for $twig->findnodes("//rss");'
2.0
Notably, the last statement prints out the attribute value of each rss element. For that, we’re using the $_->att() function. It expects an attribute name, which in this case is “version“.
In this article, we discussed how to process an XML document to extract tag values and attributes. For that purpose, we made use of xmllint and starletxml.
In addition, we also looked at Perl, which lets us be more flexible when it comes to text processing.