Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: November 5, 2024
We can use wget to download files from the web. In particular, wget can handle a wide range of content including images, documents, and entire websites. If we want to automate downloading images from a website, wget provides a flexible and efficient solution.
In this tutorial, we’ll explore how to use wget to download images from a website. For illustration, we’ll explore practical examples to demonstrate the use of wget for saving images directly into a desired folder, limiting the scope of downloads, and handling website restrictions such as those defined in the robots.txt file.
Using wget is straightforward. However, to customize it for downloading images, we need to adjust the command to suit our specific requirements. In particular, while the basic command retrieves files in general, incorporating specific filters and options enhances its effectiveness for targeting image files. For downloading only images, we can set wget to focus solely on .jpg and .png format, ignoring other types of content.
For example, let’s download all the images from the Flutterwave website:
$ wget -r -A jpg,png https://www.flutterwave.com
--2024-10-19 02:37:58-- https://www.flutterwave.com/
Resolving www.freepik.com (www.flutterwave.com)... 35.190.81.132
Connecting to www.flutterwave.com (www.flutterwave.com)|35.190.81.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 297876 (291K) [text/html]
Saving to: ‘www.flutterwave.com/index.html.tmp’
www.flutterwave.com/index.html.t 100%[==============================================>] 290.89K 432KB/s in 0.7s
2024-10-19 02:38:02 (432 KB/s) - ‘www.flutterwave.com/index.html.tmp’ saved [297876/297876]
Loading robots.txt; please ignore errors.
...
In this command, the -r flag enables recursive downloading. It scans through the site and retrieves all the images linked within. Additionally, the -A flag specifies the file format we need, ensuring only images are downloaded.
It’s important to keep downloaded images organized for accessibility. The wget command stores downloaded files in the same folder where the command is run by default. However, we can specify a destination folder when dealing with multiple files or large collections of images.
We can use the -P option with wget to specify a destination path for downloaded files. For example, let’s create a folder called images on the Downloads path and direct wget to save all downloaded images in the created location:
$ wget -nd -r -A jpg,png -P ~/Downloads/images https://flutterwave.com
--2024-11-02 13:24:40-- https://flutterwave.com/
Resolving flutterwave.com (flutterwave.com)... 13.248.168.217
Connecting to flutterwave.com (flutterwave.com)|13.248.168.217|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /ng/ [following]
--2024-11-02 13:24:40-- https://flutterwave.com/ng/
Reusing existing connection to flutterwave.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 602921 (589K) [text/html]
Saving to: ‘/home/kali/Desktop/images/index.html.tmp’
index.html.tmp 100%[==============================================>] 588.79K 1.87MB/s in 0.3s
...
FINISHED --2024-11-02 13:24:42--
Total wall clock time: 2.8s
Downloaded: 13 files, 922K in 0.3s (2.74 MB/s)
Let’s break down what each of the options does in the command:
To confirm the images were downloaded successfully, we can navigate to the created folder and use the ls command to list all the files available in the directory:
$ cd Downloads/images
$ ls
apple-touch-icon.png dots.png flw-mobile-production.jpg icon_64x64.776a2a.png
box.png favicon-16x16.png globe.png noise.png
checkout.png favicon-32x32.png icon_512x512.776a2a.png send-app-production.jpg
The command shows all the image files downloaded from the site using wget.
When downloading images from a website, it’s important to avoid excessive downloads that can consume unnecessary bandwidth or storage space. The wget command offers several options to control the depth and extent of downloads, preventing overuse of resources and keeping downloads focused.
The -l option restricts the level of recursion, which defines how many links deep wget can access within a website. Therefore, setting a lower value for -l keeps the download within a manageable scope, while a higher value enables deeper retrieval but could include redundant or irrelevant files.
For example, let’s limit the download to just the first two levels of a website:
$ wget -nd -r -l 2 -A jpg,png -P ~/Downloads/images https://flutterwave.com
--2024-11-02 14:54:29-- https://flutterwave.com/
Resolving flutterwave.com (flutterwave.com)... 13.248.168.217
Connecting to flutterwave.com (flutterwave.com)|13.248.168.217|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /ng/ [following]
--2024-11-02 14:54:30-- https://flutterwave.com/ng/
Reusing existing connection to flutterwave.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 602921 (589K) [text/html]
...
This command limits downloads to two levels from the main page, preventing the retrieval of excess files and data.
Furthermore, to avoid overloading the web server, we can set intervals between requests, which is a good practice. We use the –wait option to introduce a pause between downloads, thereby reducing the load on the server.
For example, let’s download all images from the website with a 1-second delay:
$ wget -nd -r -l 2 -A jpg,png -P ~/Downloads/images --wait=1 https://flutterwave.com
Here, –wait=1 ensures there’s a 1-second delay between each request. Consequently, this makes the download process gentler on the server without sacrificing efficiency.
The robots.txt file on websites specifies which parts of a site are accessible to web crawlers. The wget command respects this file to avoid downloading content that the site owner wants to keep restricted. However, if we need to download images from pages blocked by robots.txt, we can override this setting. Nevertheless, we need to proceed with caution as it may go against the website’s policies.
To ignore robots.txt restrictions, we use the -e robots=off option, which allows wget to bypass the robots.txt rules for the site.
For example, let’s download all image files from the website while bypassing the restrictions set in the robots.txt file:
$ wget -nd -r -A jpg,png -e robots=off -P ~/Downloads/images https://flutterwave.com
This command uses the -e robots=off option to bypass the robots.txt restrictions, enabling wget to download images from the site that might be restricted.
In this article, we’ve explored how to use wget to download images efficiently from websites. Furthermore, we’ve seen how to tailor wget to meet specific requirements while managing resources by customizing options like output directories, recursion depth, and download interval.
Using all these techniques, wget can be a powerful tool for automating downloads in a structured way.