Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: February 25, 2024
Both HTML and XML are markup languages but with different purposes. While HTML focuses on structuring and presenting information on web pages, XML is used to store and transport data between different systems. Sometimes, we need to extract or manipulate text within these documents which may involve removing the tags for analysis.
In this tutorial, we’ll discuss removing tags from HTML/XML documents. To achieve this, we’ll use sed, awk, Perl, and Python in the command line.
sed is a command line tool used to perform text processing and pattern matching on an input stream.For this reason, we’ll use it to remove tags from an HTML or XML document.
Now, let’s use sed to remove the tags:
$ sed -e ':a;N;$!ba;s/<[^>]*>//g' index.html
My Blog Website
Hello New User, welcome to My Blog website where you can find anything
Let’s understand the above command:
Using the above command, we remove all the tags in the index.html document and print out the file’s content.
Furthermore, to remove tags from an XML document, we’ll use the same command we used above. That is, we’ll just replace the input file with an XML file:
$ sed -e ':a;N;$!ba;s/<[^>]*>//g' names.xml
Gambardella, Matthew
XML Developer's Guide
Computer
Additionally, we can redirect the output to another file:
$ sed -e ':a;N;$!ba;s/<[^>]*>//g' names.xml > removed_xml_tags.txt
Above, after removing the tags in the names.xml file, we redirect the output of sed to a new file named removed_xml_tags.txt.
awk is a command-line tool that allows us to search and manipulate data on text files. For instance, let’s use it to remove tags from HTML or XML documents.
To demonstrate, we’ll begin by removing tags in an XML document using awk:
$ awk 'BEGIN {RS="<[^>]+>"} {gsub(/[\t\n ]+/, " "); print}' names.xml > removed_xml_tags.txt
Let’s understand this command:
The above command removes all the tags in the names.xml file and redirects the output to a file named removed_xml_tags.txt.
Likewise, to remove tags from an HTML document, we use the same command:
$ awk 'BEGIN {RS="<[^>]+>"} {gsub(/[\t\n ]+/, " "); print}' index.html > removed_html_tags.txt
Here, we remove tags from index.html and then redirect the output to a file named removed_html_tags.txt.
Perl is a programming language we can use to manipulate and process text. We can use it for complex string manipulation using regular expressions.
Now, let’s use Perl to remove tags:
$ perl -pe 's/<[^>]*>//g' names.xml
Gambardella, Matthew
XML Developer's Guide
Computer
Let’s understand the command:
Here, we remove all the tags in the names.xml file and print the text content.
Next, let’s remove tags from an HTML document. We’ll make use of the HTML::Strip Perl module.
First, we need to install it. On Ubuntu/Debian distributions we use apt:
$ sudo apt install libhtml-strip-perl
On Arch Linux, we use pacman:
$ sudo pacman -S perl-html-strip
Finally, on Fedora, we use dnf:
$ sudo dnf install perl-HTML-Strip
Now, let’s remove the tags:
$ perl -MHTML::Strip -0777 -pe '$_ = HTML::Strip->new()->parse($_)' index.html > removed_html_tags.txt
Let’s break down the above command:
Using the above command, we successfully remove all the tags in the index.html file. We then redirect the output to a new file named removed_html_tags.txt.
Python is a programming language used to parse and process text. To illustrate, we’ll use it to remove tags from HTML and XML documents.
First, we’ll start by removing tags from an HTML document. Furthermore, we’ll make use of the Beautifulsoup4 library.
Now, we start by installing Beautifulsoup using pip:
$ pip3 install beautifulsoup4
Once installed, let’s go ahead and remove the tags:
$ python3 -c "from bs4 import BeautifulSoup; print(BeautifulSoup(open('index.html', 'r').read(), 'html.parser').get_text())"
Let’s break down this command:
Using the above command, we remove tags in the index.html file and print the output to the terminal.
Next, let’s remove tags from an XML document:
$ python3 -c "import xml.etree.ElementTree as ET; print(''.join(ET.fromstring(open('names.xml', 'r').read()).itertext()))"
Let’s understand the above command:
Here, we remove tags from the names.xml file and print the output to the terminal.
Additionally, we can save the output in another file:
$ python3 -c "import xml.etree.ElementTree as ET; print(''.join(ET.fromstring(open('names.xml', 'r').read()).itertext()))" > removed_xml_tags.txt
Here, we use > to redirect the output to a file named removed_xml_tags.txt instead of printing it on the terminal.
In this article, we discussed different methods for removing tags from HTML and XML documents in Linux. To summarize, sed, and awk are suitable for simple tag removal, while Perl and Python are suitable for complex tag removal. We can use any of these methods according to our preference.