Baeldung Pro – Linux – NPI EA (cat = Baeldung on Linux)
announcement - icon

Learn through the super-clean Baeldung Pro experience:

>> Membership and Baeldung Pro.

No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.

Partner – Orkes – NPI EA (tag=Kubernetes)
announcement - icon

Modern software architecture is often broken. Slow delivery leads to missed opportunities, innovation is stalled due to architectural complexities, and engineering resources are exceedingly expensive.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

With Orkes Conductor managed through Orkes Cloud, developers can focus on building mission critical applications without worrying about infrastructure maintenance to meet goals and, simply put, taking new products live faster and reducing total cost of ownership.

Try a 14-Day Free Trial of Orkes Conductor today.

1. Introduction

We can use wget to download files from the web. In particular, wget can handle a wide range of content including images, documents, and entire websites. If we want to automate downloading images from a website, wget provides a flexible and efficient solution.

In this tutorial, we’ll explore how to use wget to download images from a website. For illustration, we’ll explore practical examples to demonstrate the use of wget for saving images directly into a desired folder, limiting the scope of downloads, and handling website restrictions such as those defined in the robots.txt file.

2. Basic wget Command to Download Images

Using wget is straightforward. However, to customize it for downloading images, we need to adjust the command to suit our specific requirements. In particular, while the basic command retrieves files in general, incorporating specific filters and options enhances its effectiveness for targeting image files. For downloading only images, we can set wget to focus solely on .jpg and .png format, ignoring other types of content.

For example, let’s download all the images from the Flutterwave website:

$ wget -r -A jpg,png https://www.flutterwave.com
--2024-10-19 02:37:58--  https://www.flutterwave.com/
Resolving www.freepik.com (www.flutterwave.com)... 35.190.81.132
Connecting to www.flutterwave.com (www.flutterwave.com)|35.190.81.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 297876 (291K) [text/html]
Saving to: ‘www.flutterwave.com/index.html.tmp’

www.flutterwave.com/index.html.t 100%[==============================================>] 290.89K   432KB/s    in 0.7s

2024-10-19 02:38:02 (432 KB/s) - ‘www.flutterwave.com/index.html.tmp’ saved [297876/297876]

Loading robots.txt; please ignore errors.
...

In this command, the -r flag enables recursive downloading. It scans through the site and retrieves all the images linked within. Additionally, the -A flag specifies the file format we need, ensuring only images are downloaded.

3. Saving Images to a Folder

It’s important to keep downloaded images organized for accessibility. The wget command stores downloaded files in the same folder where the command is run by default. However, we can specify a destination folder when dealing with multiple files or large collections of images.

We can use the -P option with wget to specify a destination path for downloaded files. For example, let’s create a folder called images on the Downloads path and direct wget to save all downloaded images in the created location:

$ wget -nd -r -A jpg,png -P ~/Downloads/images https://flutterwave.com
--2024-11-02 13:24:40--  https://flutterwave.com/
Resolving flutterwave.com (flutterwave.com)... 13.248.168.217
Connecting to flutterwave.com (flutterwave.com)|13.248.168.217|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /ng/ [following]
--2024-11-02 13:24:40--  https://flutterwave.com/ng/
Reusing existing connection to flutterwave.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 602921 (589K) [text/html]
Saving to: ‘/home/kali/Desktop/images/index.html.tmp’

index.html.tmp               100%[==============================================>] 588.79K  1.87MB/s    in 0.3s

...

FINISHED --2024-11-02 13:24:42--
Total wall clock time: 2.8s
Downloaded: 13 files, 922K in 0.3s (2.74 MB/s)

Let’s break down what each of the options does in the command:

  • -nd: prevents the creation of a directory hierarchy
  • -r: downloads the files recursively, scanning through the site for any linked .jpg and .png images
  • -A jpg,png: restricts downloads to only these file types
  • -P ~/Downloads/images: directs wget to store all files in the specified images folder

To confirm the images were downloaded successfully, we can navigate to the created folder and use the ls command to list all the files available in the directory:

$ cd Downloads/images
$ ls
apple-touch-icon.png  dots.png           flw-mobile-production.jpg  icon_64x64.776a2a.png
box.png               favicon-16x16.png  globe.png                  noise.png
checkout.png          favicon-32x32.png  icon_512x512.776a2a.png    send-app-production.jpg

The command shows all the image files downloaded from the site using wget.

4. Preventing Over-Downloading and Restricting Depth

When downloading images from a website, it’s important to avoid excessive downloads that can consume unnecessary bandwidth or storage space. The wget command offers several options to control the depth and extent of downloads, preventing overuse of resources and keeping downloads focused.

4.1. Limiting Download Depth

The -l option restricts the level of recursion, which defines how many links deep wget can access within a website. Therefore, setting a lower value for -l keeps the download within a manageable scope, while a higher value enables deeper retrieval but could include redundant or irrelevant files.

For example, let’s limit the download to just the first two levels of a website:

$ wget -nd -r -l 2 -A jpg,png -P ~/Downloads/images https://flutterwave.com
--2024-11-02 14:54:29-- https://flutterwave.com/
Resolving flutterwave.com (flutterwave.com)... 13.248.168.217
Connecting to flutterwave.com (flutterwave.com)|13.248.168.217|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /ng/ [following]
--2024-11-02 14:54:30-- https://flutterwave.com/ng/
Reusing existing connection to flutterwave.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 602921 (589K) [text/html]
...

This command limits downloads to two levels from the main page, preventing the retrieval of excess files and data.

4.2. Specifying Wait Intervals to Avoid Server Overload

Furthermore, to avoid overloading the web server, we can set intervals between requests, which is a good practice. We use the –wait option to introduce a pause between downloads, thereby reducing the load on the server.

For example, let’s download all images from the website with a 1-second delay:

$ wget -nd -r -l 2 -A jpg,png -P ~/Downloads/images --wait=1 https://flutterwave.com

Here, –wait=1 ensures there’s a 1-second delay between each request. Consequently, this makes the download process gentler on the server without sacrificing efficiency.

5. Handling robots.txt

The robots.txt file on websites specifies which parts of a site are accessible to web crawlers. The wget command respects this file to avoid downloading content that the site owner wants to keep restricted. However, if we need to download images from pages blocked by robots.txt, we can override this setting. Nevertheless, we need to proceed with caution as it may go against the website’s policies.

To ignore robots.txt  restrictions, we use the -e robots=off option, which allows wget to bypass the robots.txt rules for the site.

For example, let’s download all image files from the website while bypassing the restrictions set in the robots.txt file:

$ wget -nd -r -A jpg,png -e robots=off -P ~/Downloads/images https://flutterwave.com

This command uses the -e robots=off option to bypass the robots.txt restrictions, enabling wget to download images from the site that might be restricted.

6. Conclusion

In this article, we’ve explored how to use wget to download images efficiently from websites. Furthermore, we’ve seen how to tailor wget to meet specific requirements while managing resources by customizing options like output directories, recursion depth, and download interval.

Using all these techniques, wget can be a powerful tool for automating downloads in a structured way.