1. Overview

In this tutorial, we’ll discuss several text-based web browsers and tools used for converting HTML to plain text. These tools are particularly useful for those who prefer a minimalist approach to web browsing or who need to convert web pages into a format that is easier to read or manipulate.

Indeed, we cannot use the command line for complex web functions like dynamic content generation and complex input controls like the select menu, date picker, color chooser, and so on.

Therefore, for our example, we’ll render a basic HTML page with the following source code:

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
    <title>Baeldung</title>
  </head>
  <body>
    <div class="header">
      <h1>Baeldung</h1>
      <div>
        <ul>
          <li><a href="#about">About</a></li>
          <li><a href="#tutorials">Tutorials</a></li>
          <li><a href="#contact">Contact</a></li>
        </div>
      </nav>
    </div>
    <div class="main">
      <div id="about">
        <h2>About Baeldung</h2>
        <p>Baeldung is a website that offers a wide range of articles and tutorials on various Java-related topics.</p>
      </div>
      <div id="tutorials">
        <h2>Tutorials</h2>
        <p>Baeldung offers tutorials on topics such as Spring Framework, Hibernate, Linux, and many more.</p>
      </div>
      <div id="contact">
        <h2>Contact</h2>
        <p>You can contact Baeldung through their website or by email at [email protected]</p>
      </div>
    </div>
    <div>
      <p>&copy; 2023 Baeldung</p>
    </div>
  </body>
</html>

2. lynx

lynx is a versatile text-based web browser that allows users to browse the internet and access websites without the need for a graphical user interface.

By default, it doesn’t ship with most Linux distributions. However, it’s available on most official repositories. We can install it with a package manager like apt under the canonical name lynx:

$ sudo apt install lynx -y

Once installed, let’s verify it:

$ lynx --versoin
Lynx Version 2.8.9rel.1 (08 Jul 2018)
libwww-FM 2.14, SSL-MM 1.4.1, OpenSSL 3.1.0, ncurses 5.7.20081102

2.1. Rendering a Local HTML File

By default, running lynx will open up the browser, which we can use to navigate the web. However, it has a specific -dump option that takes an HTML page as an argument.

Let’s see its general syntax:

$ lynx [OPTIONS] -dump <FILE|URL>

-dump renders the HTML page in the command line:

$ lynx -dump index.html

Baeldung

     * [1]About
     * [2]Tutorials
     * [3]Contact

About Baeldung
       Baeldung is a website that offers a wide range of articles and
       tutorials on various Java-related topics.

Tutorials
       Baeldung offers tutorials on topics such as Spring Framework,
       Hibernate, Linux, and many more.

Contact
       You can contact Baeldung through their website or by email at
       [email protected]
       ? 2023 Baeldung

References

   1. file:///Users/himhaidar/Documents/index.html#about
   2. file:///Users/himhaidar/Documents/index.html#tutorials
   3. file:///Users/himhaidar/Documents/index.html#contact

It isn’t that pretty, but it gets the job done quickly.

2.2. Fetching and Rendering a Page From the Web

In addition to that, we can also fetch and render a page from the web using curl:

$ curl -Ls "https://en.wikipedia.org/wiki/Cryptosystem" | lynx -dump -stdin
...
   In [64]cryptography, a cryptosystem is a suite of [65]cryptographic
   algorithms needed to implement a particular security service, such as
   confidentiality ([66]encryption).^[67][1]

   Typically, a cryptosystem consists of three algorithms: one for [68]key
...

The -stdin flag lets lynx read the contents from the standard output instead of a file. However, we can also directly type in the URL and dump the page:

$ lynx -dump "https://en.wikipedia.org/wiki/Cryptosystem"

3. w3m

w3m is a Text-based  UI(TUI) web browser that allows us to render and view web pages in an efficient manner. Additionally, we can also integrate it with text editors like Vim and Emacs for quick browsing.

Like lynx, it’s not installed on most distributions. So, we’ll need to install it using its canonical name, w3m:

$ sudo apt install w3m -y

Once installed, we can verify it:

$ w3m -version
w3m version w3m/0.5.3+git20200502, options lang=en,m17n,color,ansi-color,mouse,menu,cookie,ssl,ssl-verify,external-uri-loader,w3mmailer,nntp,ipv6,alarm,mark

3.1. Rendering a Local HTML File

Here’s the usage syntax for w3m:

$ w3m [OPTIONS] <FILE|URL>

Similar to lynxw3m also has a -dump option that lets us render an HTML file:

$ w3m -dump index.html
Baeldung

  • About
  • Tutorials
  • Contact
   
    About Baeldung

    Baeldung is a website that offers a wide range of articles and tutorials on
    various Java-related topics.

    Tutorials

    Baeldung offers tutorials on topics such as Spring Framework, Hibernate,
    Linux, and many more.

    Contact

    You can contact Baeldung through their website or by email at
    [email protected]

    © 2023 Baeldung

As we can see, there’s less clutter as compared to the lynx result.

3.2. Render a Page From the Web

We can render a page from the web by giving the URL as an input instead of a local file:

$ w3m -dump https://en.wikipedia.org/wiki/Cryptosystem

4. html2text

html2text is a Python script that lets us extract textual data from an HTML page. We can use it to render local HTML files on the system.

Like the other tools, html2text isn’t installed on most Linux distributions. So, we’ll need to install it from the PyPI using pip:

$ pip install html2text

Once installed, let’s verify it:

$ html2text -version
This is html2text, version 2.1.1

4.1. Converting HTML Documents to Plain Text

Like the other tools, html2text has a straightforward syntax:

$ html2text [OPTIONS] [FILE]

Notably, it can only render a file on the disk and cannot fetch a page from the web on its own. So, let’s go ahead and feed it our HTML file:

$ html2text index.html
****** Baeldung ******
    * About
    * Tutorials
    * Contact
***** About Baeldung *****
Baeldung is a website that offers a wide range of articles and tutorials on
various Java-related topics.
***** Tutorials *****
Baeldung offers tutorials on topics such as Spring Framework, Hibernate, Linux,
and many more.
***** Contact *****
You can contact Baeldung through their website or by email at
[email protected]
© 2023 Baeldung

4.2. Fetching a Page From the Web

We can use curl to fetch and input a page to html2text:

$ html2text <<< $(curl -Ls "https://en.wikipedia.org/wiki/Cryptosystem")

In the snippet, instead of a file, we used a Here String that emulates the same process as providing a file as an input.

5. Conclusion

In this article, we discuss how we can render and view HTML pages in the command line. For that purpose, we used the famous lynx, w3m, and html2text tools.

While there are other tools available on the web, these are readily available for installation on most package repositories.

Comments are closed on this article!