Baeldung Pro – Linux – NPI EA (cat = Baeldung on Linux)
announcement - icon

Learn through the super-clean Baeldung Pro experience:

>> Membership and Baeldung Pro.

No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.

Partner – Orkes – NPI EA (tag=Kubernetes)
announcement - icon

Modern software architecture is often broken. Slow delivery leads to missed opportunities, innovation is stalled due to architectural complexities, and engineering resources are exceedingly expensive.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

With Orkes Conductor managed through Orkes Cloud, developers can focus on building mission critical applications without worrying about infrastructure maintenance to meet goals and, simply put, taking new products live faster and reducing total cost of ownership.

Try a 14-Day Free Trial of Orkes Conductor today.

1. Introduction

cURL is a tool for interacting with web servers from the command line. In particular, it allows us to transfer data to or from a server using protocols like HTTP, HTTPS, and FTP.

It’s useful for tasks like downloading single files from a remote server. However, downloading all files from a directory on a server requires additional steps, since cURL doesn’t support downloading files recursively.

In this tutorial, we’ll explore how to download all files from a specific directory using cURL, including common workarounds for its limitations and alternative approaches. We’ll also cover solutions for cases like handling directory listings and recursive downloads.

2. Basic Usage of the curl Command

cURL is most commonly used to download individual files.

Let’s start by looking at a basic example of downloading a single file using the curl command:

$ curl -O https://example.com/example_file.txt

The -O flag instructs curl to save the file with its original name in the current directory.

However, cURL doesn’t have built-in support for downloading all files in a directory. Unlike FTP or some other protocols, HTTP doesn’t inherently provide a way to directly download all files in a directory unless the server is specifically configured for this.

3. Challenges with cURL and Directories

When working with web servers over HTTP, we’ll often encounter directory listings if the server is configured to allow it. Most web servers don’t allow us to directly view or download entire directories because they typically don’t provide a directory index, unless configured to do so. In most cases, servers are set up to block access to directory listings for security reasons.

If the web server provides a directory listing, we can scrape the list of files and use that information to download them. If not, we’ll need other strategies, like FTP or WebDAV.

Next, we’ll explore a few strategies for downloading multiple files in a directory when an index or listing page is available.

4. Downloading All Files from a Directory Listing with cURL

If the server provides a directory listing as an HTML page, we can use cURL to fetch that page, parse it, and download the individual files listed. Let’s look at the step-by-step process of achieving this.

4.1. Fetch the Directory Listing

The first step is to download the HTML content of the directory listing page:

$ curl https://example.com/files_directory/ > directory-listing.html

This will save the directory’s index page as directory-listing.html.

To extract the file names from the HTML page, we can use a tool like grep to parse the links. Typically, file links in directory listings are found inside <a> tags with the href attribute.

Let’s extract these links using grep:

$ grep -oP '(?<=href=")[^"]*' directory-listing.html

The -oP option extracts only parts of the file that match the regular expression. In this case, the regular expression (?<=href=”)[^”]* looks for the strings inside the href attribute, which correspond to the file links.

This command produces a list of file names on the directory listing page.

4.3. Downloading Each File with cURL

Once we have the list of file names, we can use xargs to download them in bulk using curl:

$ grep -oP '(?<=href=")[^"]*' directory-listing.html | xargs -n 1 -I {} curl -O https://example.com/files_directory/{}

In this command, grep extracts file names from the directory listing page and then pipes the filenames to xargs, which runs curl for each file. Finally, the -O option saves the files with their original names.

4.4. Automating the Process with a Script

We can easily automate the process, especially if we regularly download files from directory listings. For instance, here’s a simple script to download all files from a directory listing page:

#!/bin/bash

# URL of the directory
URL="https://example.com/files_directory/"

# Download the directory listing
curl $URL > directory-listing.html

# Extract file links and download them
grep -oP '(?<=href=")[^"]*' directory-listing.html | while read -r file; do
    curl -O $URL/$file
done

The script above downloads the directory listing using curl and extracts all file links using grep. Finally, it iterates over each file link and downloads it using curl.

We then save the script and make it executable:

$ chmod +x download_files.sh

We can also modify the script to download files from different directories.

5. Handling Servers Without Directory Listings

If the server doesn’t provide a directory listing, we can check for an FTP or WebDAV interface that lets us list and download files. Moreover, curl can also be used to interact with FTP or WebDAV servers.

Let’s download files via FTP:

$ curl ftp://username:[email protected]/files/ -O

Alternatively, we can use a different tool like wget. It’s an excellent alternative that supports recursive downloading and can handle directory structures more easily than curl.

Let’s look at a simple example of how to download files from a directory recursively:

$ wget -r -np -nH --cut-dirs=1 -R "index.html*" https://example.com/files/

Let’s look at the breakdown of the command above:

  • -r enables recursive downloading
  • -np ensures that wget doesn’t ascend to the parent directory
  • -nH prevents the creation of a host directory
  • –cut-dirs=1 removes a portion of the directory path
  • -R “index.html*” excludes index.html files from being downloaded

wget is a more robust tool for downloading entire directory structures, making it a useful alternative when curl falls short.

In some directory listings, file links might be relative rather than absolute. In such cases, we’ll need to modify the script to handle this.

Let’s look at examples of relative links that we might find:

<a href="file1.zip">file1.zip</a>
<a href="subdirectory/file2.zip">file2.zip</a>

We can adjust the curl command to account for the base URL and relative paths. For example, here’s an updated version of the script that handles both absolute and relative links:

#!/bin/bash

# URL of the directory
URL="https://example.com/files_directory/"

# Download the directory listing
curl $URL > directory-listing.html

# Extract file links and download them
grep -oP '(?<=href=")[^"]*' directory-listing.html | while read -r file; do
    # Handle both relative and absolute links
    if [[ $file == http* ]]; then
        curl -O $file
    else
        curl -O $URL$file
    fi
done

We’re constructing the full URL for each file by appending the extracted file’s name to the base URL stored in the $URL variable.

On the other hand, if the server requires authentication, we can include the -u option in the curl command:

$ curl -u username:password -O https://example.com/files_directory/file.zip

We can specify a username and password to download the file.

7. Conclusion

In this article, we’ve looked at how to download all files within a directory using cURL. Additionally, we wrote a script that automates the process while considering absolute and relative file paths.

Downloading all files in a directory with curl can be tricky since HTTP servers don’t always expose directory listings. Moreover, curl doesn’t have built-in support for recursive downloads. However, by combining curl with tools such as grep and xargs, we can automate the process for servers that provide directory listings.

In cases where the server doesn’t allow directory listings, or when we need recursive downloads, wget may be a better option.