1. Introduction

The standard wget tool enables remote directory mirroring. By default, the mirror includes all directories in the provided path, regardless of their level, even if they themselves contain a single subdirectory.

In this tutorial, we explain mirroring and how to skip creating a long path of unneeded directories when mirroring with wget. First, we discuss how cloning with wget usually works. After that, we explore switches for controlling directory creation while mirroring.

For brevity and clarity, all examples work on an HTTP web server (Apache) with directory listing, and browsing enabled and exclude any index.html files.

We tested the code in this tutorial on Debian 11 (Bullseye) with GNU Bash 5.1.4. It should work in most POSIX-compliant environments.

2. wget Mirroring

Normally, wget acts on one Universal Resource Locator (URL):

$ wget https://gerganov.com/
[...]
HTTP request sent, awaiting response... 200 OK
Length: 1024 (1.0K) [text/html]
Saving to: ‘index.html’

index.html       100%[=========>]   1.00K  --.-KB/s    in 0s
[...]

The above results in a single file, either the main page or another file at that URL, as we can verify via ls:

$ ls
index.html

However, for cloning, wget supports two main options. One builds on top of the other.

2.1. –recursive (-r) Clone

The simple –recursive or -r flag of wget acts similarly to such flags in other commands like cp.

In essence, this option makes wget process the supplied URL and each path that includes it. In other words, acting on https://xost/root/dir/ would attempt to download any path with leafs removed, e.g., https://xost/root, or with that prefix, e.g., https://xost/root/dir/subdir:

$ wget --recursive https://gerganov.com/
[...]
Saving to: ‘gerganov.com/index.html’
[...]
Saving to: ‘gerganov.com/bg/index.html’
[...]
Saving to: ‘gerganov.com/de/index.html’
[...]
Downloaded: 3 files, 3K in 0s (100.0 MB/s)
$ ls ./gerganov.com/
bg  de  index.html

Evidently, using recursion, in this case, downloads all translations along with the main page since they are in subdirectories. As ls verifies, –recursive results in a directory tree with a top-level parent directory of gerganov.com.

We can turn to a more complex option for more extensive and refined cloning.

2.2. –mirror (-m) Clone

The –mirror or -m option combines several flags into one:

Perhaps the most important change is to –level, which is otherwise 5 by default:

$ wget --recursive https://xost/
[...]
$ tree xost/
xost/
└── subdir1
    ├── file1a
    ├── file1b
    └── subdir2
        └── subdir3
            ├── file3
            └── subdir4
                └── subdir5

5 directories, 3 files
$ wget --mirror https://xost/
[...]
$ tree xost/
xost/
└── subdir1
    ├── file1a
    ├── file1b
    └── subdir2
        └── subdir3
            ├── file3
            └── subdir4
                └── subdir5
                    ├── file5
                    └── subdir6
                        └── file6

6 directories, 5 files

Here, we don’t get subdir6 with –recursive, but we do with –mirror, as tree shows.

3. wget Mirror Directory Control

Even if we supply a longer path to wget, we’ll still locally end up with the full branch of paths from the root up to the leaf along with all the data:

$ wget --mirror https://xost/subdir1/subdir2/subdir3/subdir4/subdir5/subdir6/
[...]
$ tree xost
xost
└── subdir1
    ├── file1a
    ├── file1b
    └── subdir2
        └── subdir3
            ├── file3
            └── subdir4
                └── subdir5
                    ├── file5
                    └── subdir6
                        └── file6

6 directories, 5 files

Notably, the / slash at the end of the mirroring command is significant. We’ll use the above structure in the examples below.

Regardless of whether we use –recursive or –mirror, wget provides several options for directory control. Let’s explore them.

3.1. Exclude Directory Chain With –no-host-directories (-nH)

To begin with, the –no-host-directories or -nH flag prevents wget from creating the top-level root host directory:

$ wget --recursive --no-host-directories https://xost/
[...]
$ tree xost
xost [error opening dir]
[...]
$ tree
.
└── subdir1
    ├── file1a
    ├── file1b
    └── subdir2
        └── subdir3
            ├── file3
            └── subdir4
                └── subdir5
                    ├── file5
                    └── subdir6
                        └── file6

6 directories, 5 files

Here, we don’t get the xost hostname directory from earlier because of –no-host-directories.

3.2. Exclude Parent With –no-parent (-np)

Next up, the –no-parent or -np flag excludes the contents of any upper-level directories that wget would otherwise retrieve:

$ wget --recursive --no-parent https://xost/subdir1/subdir2/subdir3/subdir4/subdir5/subdir6/
[...]
$ tree xost/
xost/
└── subdir1
    └── subdir2
        └── subdir3
            └── subdir4
                └── subdir5
                    └── subdir6
                        └── file6

6 directories, 1 file

In this case, we don’t get any data above subdir6. However, we still have the whole directory chain. Let’s see how we can control that behavior.

3.3. Omitting Directories With –cut-dirs

When mirroring with wget, we can use –cut-dirs to skip part of the directories in a chain.

Continuing our earlier example, if we mirror a URL with both –no-parent and –no-host, we’d have the chain of directories without the top level and with data only below the requested path:

$ wget --recursive --no-host-directories --no-parent https://xost/subdir1/subdir2/subdir3/subdir4/subdir5/subdir6/
[...]
$ tree
.
└── subdir1
    └── subdir2
        └── subdir3
            └── subdir4
                └── subdir5
                    └── subdir6
                        └── file6

6 directories, 1 file

Using –cut-dirs, we can reduce the empty directories in the chain:

$ wget --recursive --no-host-directories --no-parent --cut-dirs=5 https://xost/subdir1/subdir2/subdir3/subdir4/subdir5/subdir6/
[...]
$ tree
.
└── subdir6
    └── file6

1 directoriy, 1 file

Assuming the same switches as above, the number after –cut-dirs dictates how many steps are dropped:

  • –cut-dirs=0, ./subdir1/subdir2/subdir3/subdir4/subdir5/subdir6
  • –cut-dirs=3, ./subdir4/subdir5/subdir6
  • –cut-dirs=5, ./subdir6

A high number results in all directories being skipped. This is the behavior of another switch.

3.4. Flatten Tree With –no-directories (-nd)

Similar to the tar command that can flatten nested directories, wget supports the –no-directories or -nd switch.

In essence, –no-directories removes any potential directories from the local wget mirror:

$ wget --recursive --no-directories http://xost/subdir1/subdir2/subdir3/subdir4/subdir5/subdir6/
$ tree
.
├── file3
├── file5
└── file6

0 directories, 3 files

As a result, all files end up in the same directory, usually the current one. Files with the same name get a number extension like .1, .2, and so on.

4. Summary

In this article, we explored ways to handle directory creation while mirroring with wget.

In conclusion, while wget locally replicates any chain of directories, we can control this behavior with various switches, depending on our needs.

Comments are closed on this article!