1. Introduction

Sometimes we want to get a certain directory that contains the files we need on a web server. Or, maybe, we want to crawl a website to be able to reach the directories we require locally.

In this article, we’ll get our hands dirty with the tool wget to learn how we can download the directories and subdirectories on the web.

2. Mirroring the Whole Website

First, we’re going to look at how to download the whole website. wget gives us the ability to mirror everything with the option –mirror, -m:

$ wget -m https://www.baeldung.com/
--2022-03-11 14:02:45--  https://www.baeldung.com/
Resolving www.baeldung.com (www.baeldung.com)... 172.66.43.8, 172.66.40.248, 2606:4700:3108::ac42:2b08, ...
Connecting to www.baeldung.com (www.baeldung.com)|172.66.43.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.baeldung.com/index.html’

www.baeldung.com/in     [ <=>                ] 137,01K  --.-KB/s    in 0,1s    

2022-03-11 14:02:45 (1,04 MB/s) - ‘www.baeldung.com/index.html’ saved [140303]

Loading robots.txt; please ignore errors.
--2022-03-11 14:02:45--  https://www.baeldung.com/robots.txt
Reusing existing connection to www.baeldung.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘www.baeldung.com/robots.txt’

www.baeldung.com/ro     [ <=>                ]      72  --.-KB/s    in 0s      

2022-03-11 14:02:45 (5,23 MB/s) - ‘www.baeldung.com/robots.txt’ saved [72]
...

Please note that this operation will take some time and memory since it is trying to download the entire website.

3. Downloading Desired Directories Recursively

Probably, mirroring the whole website like above would not be helpful because of its inflexibility. Generally, we would like to get specific directories according to our needs. Fortunately, wget enables us to do so as well. We switch on the recursive download with the option –recursive (-r) in order to get the desired subdirectories. In subsequent sections, we will combine this option with other wget options to meet the desired actions.

3.1. wget With –no-host-directories and –cut-dirs Options

The first way to achieve our goal with wget is by using the options –no-host-directories (-nh) and –cut-dirs. -nh option disables the directories that are prefixed by the hostname. The second option –cut-dirs, on the other hand, specifies the number of directory components to be ignored. With these options, we can manipulate the recursive retrieving of the directories.

For example, if we only use the option -r to download the subdirectories of www.baeldung.com/linux/category/web, we end up with 4 directories directly. However, when we add the option -nh, we get linux/category/web directory path. Moreover, by setting the value of –cut-dirs, we can use this directory trick further. By setting this second option to 1, we attain category/web. And with the value 2, we get web/ and so on. Let’s see the complete command:

$ wget -r -np -nH --cut-dirs=1 https://www.baeldung.com/linux
--2022-03-11 15:26:49--  https://www.baeldung.com/linux
Resolving www.baeldung.com (www.baeldung.com)... 172.66.43.8, 172.66.40.248, 2606:4700:3108::ac42:2b08, ...
Connecting to www.baeldung.com (www.baeldung.com)|172.66.43.8|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.baeldung.com/linux/ [following]
--2022-03-11 15:26:49--  https://www.baeldung.com/linux/
Reusing existing connection to www.baeldung.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘linux’

linux                        [ <=>                            ] 109,52K  --.-KB/s    in 0,1s    

2022-03-11 15:26:49 (877 KB/s) - ‘linux’ saved [112148]

Loading robots.txt; please ignore errors.
--2022-03-11 15:26:49--  https://www.baeldung.com/robots.txt
Reusing existing connection to www.baeldung.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘robots.txt’

robots.txt                   [ <=>                            ]      72  --.-KB/s    in 0s      

2022-03-11 15:26:49 (5,48 MB/s) - ‘robots.txt’ saved [72]
...

Let’s make sure to keep the –no-parent (-np) option if downloading the parent directory is not desired.

3.2. wget With –level Option

The second way to reach our purpose using wget is with the use of –level (-l). This option restricts the depth of subdirectories that wget will recurse into. For example, if we want to download the subdirectories of www.baeldung.com/linux  with the level value of 1, wget retrieves the first level subdirectories that lie in linux/ like linux/category.

If we increase this level value to 2, then wget goes into other subdirectories under linux/category as well. Let’s look at an example:

$ wget -np -r -l 2 https://www.baeldung.com/linux/
--2022-03-11 16:17:38--  https://www.baeldung.com/linux/
Resolving www.baeldung.com (www.baeldung.com)... 172.66.43.8, 172.66.40.248, 2606:4700:3108::ac42:28f8, ...
Connecting to www.baeldung.com (www.baeldung.com)|172.66.43.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.baeldung.com/linux/index.html’

www.baeldung.com/linux/i     [ <=>                            ] 109,52K  --.-KB/s    in 0,1s    

2022-03-11 16:17:39 (903 KB/s) - ‘www.baeldung.com/linux/index.html’ saved [112148]
...

Note that we use -np and -r options here again.

The default value of –level is 5. So, if we don’t specify a value for this option, wget will recurse into 5 levels of depth. Also, the value of 0 for this option is equal to infinity.

4. Additional Features

wget is a quite powerful tool, and it provides more additional features for us to use. In this section, we’re going to look at some of the core options we need most of the time.

4.1. Changing the Browser

Sometimes, we need to define a user agent manually to troubleshoot some problems. Similarly, this option offers flexibility that can be quite useful in some cases. For example, wget might raise some errors:

$ wget -r https://www.baeldung.com/linux/
--2022-03-11 16:39:18--  https://www.baeldung.com/linux/
Resolving www.baeldung.com (www.baeldung.com)... 172.66.40.248, 172.66.43.8, 2606:4700:3108::ac42:2b08, ...
Connecting to www.baeldung.com (www.baeldung.com)|172.66.40.248|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2022-03-11 16:39:19 ERROR 403: Forbidden.

We can resolve these types of issues by changing our browser with the –user-agent (-U). Let’s give it a try:

$ wget -r --user-agent="Mozilla" https://www.baeldung.com/linux/
--2022-03-11 16:45:17--  https://www.baeldung.com/linux/
Resolving www.baeldung.com (www.baeldung.com)... 172.66.40.248, 172.66.43.8, 2606:4700:3108::ac42:28f8, ...
Connecting to www.baeldung.com (www.baeldung.com)|172.66.40.248|:443... connected.
HTTP request sent, awaiting response... 200 OK
...

If we want to make the links suitable for local inspection, we can utilize the option –convert-links. This option converts the links after downloading:

wget -r --no-parent --convert-links https://www.baeldung.com/linux/category/web
--2022-03-11 17:31:46--  https://www.baeldung.com/linux/category/web
Resolving www.baeldung.com (www.baeldung.com)... 172.66.43.8, 172.66.40.248, 2606:4700:3108::ac42:28f8, ...
Connecting to www.baeldung.com (www.baeldung.com)|172.66.43.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
...
...
FINISHED --2022-03-11 17:01:46--
Total wall clock time: 3m 0s
Downloaded: 11 files, 452K in 0,7s (630 KB/s)
Converting links in www.baeldung.com/linux/category/networking... 24-29
Converting links in www.baeldung.com/linux/category/scripting... 24-46
Converting links in www.baeldung.com/linux/category/security... 24-22
Converting links in www.baeldung.com/linux/category/processes... 24-45
Converting links in www.baeldung.com/linux/category/files... 24-60
Converting links in www.baeldung.com/linux/category/administration... 24-43
Converting links in www.baeldung.com/linux/category/search... 24-21
Converting links in www.baeldung.com/linux/category/web... 24-18
Converting links in www.baeldung.com/linux/category/filesystems... 24-29
Converting links in www.baeldung.com/linux/category/installation... 24-17
Converted links in 10 files in 0,01 seconds.

4.3. Switching Off Robot Exclusion

wget follows the Robot Exclusion Standard, which was written by Martijn Koster et al. in 1994. According to this standard, there is a text file that instructs the robots which directory paths to avoid in downloading operation. wget first requests the text file, robots.txt, to comply with the directives that are given by the webserver administration. This process sometimes prevents us to retrieve the directories we want. Therefore, we can switch this robot exclusion off:

$ wget -r --level=1 --no-parent --convert-links -e robots=off -U="Mozilla" https://www.baeldung.com/linux/
--2022-03-11 17:48:36--  https://www.baeldung.com/linux/
Resolving www.baeldung.com (www.baeldung.com)... 172.66.40.248, 172.66.43.8, 2606:4700:3108::ac42:28f8, ...
Connecting to www.baeldung.com (www.baeldung.com)|172.66.40.248|:443... connected.
HTTP request sent, awaiting response... 200 OK
...

5. Conclusion

In this tutorial, we learned all about the essentials for downloading directories and subdirectories recursively on the web. We saw how to manipulate the tool to get only the directories we needed and discussed ways to make it possible. Above all, wget is a quite powerful tool. So, to learn more about its abilities, we encourage you to read the man page and try the things we walked you through in this article on your own.

1 Comment
Oldest
Newest
Inline Feedbacks
View all comments
Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.