Linux, like all of the Unix based operating systems, comes with programmable shells. Using these shells, we can create scripts and small programs to automate our daily tasks.
In this tutorial, we’ll see different ways to get the contents of a webpage into a shell variable.
curl is a tool used to transfer data using URLs. By default curl directly outputs the webpage without any extra information to the console, so it’s very suited for script usage. It doesn’t require any parameters to download a webpage. Yet, some sites return an HTTP Code 30X with an address link to indicate that a redirect to this new address should be done. To make curl follow these redirection links, we need to add the -L parameter:
$ CONTENT=$(curl -L baeldung.com) $ echo $CONTENT <!doctype html> <html lang="en-US" class="no-js"> <head><meta charset="utf-8"> ...
Using curl has the following advantages:
- Can do parallel downloads
- Uses the proxy configuration in http_proxy variable that allows curl to use proxies for downloading information
GNU wget is a free software package for retrieving files using HTTP, HTTPS, FTP, and FTPS protocols. It is a non-interactive tool, so it’s also a good fit for script usage.
By default, wget produces output related to the download process and saves the result a to file:
$ wget baeldung.com --2020-09-13 14:50:19-- http://baeldung.com/ Resolving baeldung.com (baeldung.com)... 22.214.171.124, 126.96.36.199, 188.8.131.52, ... Connecting to baeldung.com (baeldung.com)|184.108.40.206|:80... connected ... ... Saving to: ‘index.html’
To modify this behavior we can use:
- -q (-quiet) parameter hides the download status output
- -O parameter changes the output of wget, “–” means stdout
By using these parameters, we can save output directly to a variable:
$ CONTENT=$(wget baeldung.com -q -O -) $ echo $CONTENT <!doctype html> <html lang="en-US" class="no-js"> <head><meta charset="utf-8"> ...
As wget is pretty similar to curl let’s briefly compare them.
There are some key advantages of using wget instead of curl:
- It has recursive download capability
- It’s a more mature project
- It follows redirection links by default, curl does not.
But curl has some advantages over wget:
- curl additionally supports FTPS, Gopher, SCP, SFTP, TFTP, TELNET, and many other protocols.
- It has more SSL options
- It’s slightly faster, which could be important when downloading large pages.
- Has SOCKS support
A more detailed comparison can be found at Linux Commands Comparison: curl vs wget.
curl and wget simply download the content from the target. Unlike them, lynx is a text-based full web browser. This means lynx works interactively by default, to allow users to surf the web. But with proper parameters, we can disable this interactive behavior and use it in our scripts.
In earlier examples, curl and wget just download the source files belonging to the given website; they are incapable of parsing the page’s source and produce a rendered page as we commonly see in our browsers.
Lynx, being a full web browser, can parse these files and produce most of the website as we see in our browsers. But we must not forget lynx is still a text-based browser, not a full browser like Firefox or Chrome so it has many limitations. This is especially important when dealing with pages heavily relying on images and/or scripting.
Let’s now see the first part of the Baeldung homepage as downloaded by curl:
$ CONTENT=$(curl -L baeldung.com | head) $ echo $CONTENT <!doctype html> <html lang="en-US" class="no-js"> <head><meta charset="utf-8"> ... ...
And the same part as downloaded and parsed by lynx:
$ CONTENT=$(lynx -dump baeldung.com | head) $ echo $CONTENT #alternate alternate alternate [tr?id=512471148948613&ev=PageView&noscript=1] The Baeldung logo * * [logo.svg] * Start Here * Courses ▼▲
As we can see, curl simply downloads the page source the website produces. lynx actually renders the page source before saving it into a variable.
In this tutorial, we have gone over how to get the contents of a webpage in a shell variable using three different tools, curl, wget, and finally lynx.