In this tutorial, we’ll look at the different methods for downloading and extracting archives without saving the archive to disk.
2. Problem Statement
The approach of obtaining the content of an archive from the internet typically involves two steps. Firstly, we download the archive and save it on the disk. This can be done using HTTP client command-line tools, such as wget or curl. Then, we’ll extract the archive to obtain the files, directories, and data within the archive.
In most cases, the archive is no longer needed after we have obtained the contents within it. Therefore, what follows afterward, usually, is to delete the archive to free up some disk space.
Given that, ultimately, we want to obtain the content of the archive and not the archive itself, writing the archive to disk seems redundant. So the question is, how can we obtain just the content of the archive file without having to save the archive file completely onto the system?
Let’s look at how we can achieve that using the Linux pipe.
3. Unarchiving on the Fly With Pipe
The idea is to extract the archive in a piecemeal fashion as they are downloaded to the system. Specifically, we can pipe the bytes of the archive file we have downloaded so far to the unarchiving process. This allows the unarchiving process to run concurrently alongside the download process, making it more efficient.
Furthermore, we save ourselves from the labor of deleting the archive in a separate command. Finally, we’ll be able to save up the disk I/O operations by not writing archive files to disk. This is especially important when we run the process on cloud-based resources that charge us by usage.
Let’s take a look at some examples of how we can achieve this for common archive file types: .tar.gz and .zip.
3.1. Extracting a .tar.gz File
For demonstration’s sake, we’ll download the 1 billion word language modelling benchmark file. This file is 1.7GB in size and it is in the .tar.gz format.
We can use a one-liner to download and pipe to the tar command to extract the content of the archive:
$ wget -qO- https://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz | tar xvz
The one-liner is made up of two parts. Firstly, the command above uses the wget HTTP client command line tool to download the .tar.gz file and pipe it to the standard output stream. Specifically, we write the downloaded bytes to the standard output using the -O- option. Additionally, we pass the -q option to silence the wget command so it doesn’t print any other messages on the standard output stream.
The second part of the command runs the tar command on the standard input stream, which will contain the downloaded bytes from the first part of the command. The tar command then extracts the archive and places the content in the current directory. The -xz option runs unarchiving and decompression operations, and the -v option turns on verbose mode.
We can also substitute the wget command invocation with the curl:
$ curl -s -L https://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz | tar xvz
By default, the curl command writes the response to the standard output stream. The -s option prevents the curl command from printing diagnostic messages on the standard output. This is mandatory since our standard output should consist of the archive file bytes only. Then, the -L option makes the curl follows a redirect.
Notice how there isn’t a .tar.gz file waiting to be cleaned up after the commands.
3.2. Extracting a .zip File
The .zip file format maintains an index file at the end of the archive file. This index file contains information about the contents of the archive and tells the decompressor where to look. This arrangement appears to pose a problem because the decompressor would require the complete archive file before they can extract it. Therefore, many .zip file decompressors do not accept input through the standard input.
However, the index file at the end of the archive is not the sole source for the meta-information about the content of the archive. In fact, each of the files maintains a local file header that consists of the meta-information about that particular file. Tools like the bsdtar command-line tool leverage this fact to unarchive .zip files through the pipe. Using these local file headers, bsdtar can unarchive files as they arrive from the pipe instead of requiring the archive file in its entirety.
For example, given a link to a .zip file, we can download it using the same HTTP client command-line tool and pipe the output to the bsdtar command:
$ wget -qO- http://mattmahoney.net/dc/enwik9.zip | bsdtar -xvf-
Option -f- of the bsdtar command means that the input for the decompression comes from the standard input. Then, the -x option specifies the unarchiving operation, and the -v enables verbose mode so we get diagnostic messages from the command.
Note that for Ubuntu Linux users, the bsdtar binary resides in the libarchive-tools package ever since the OS version 20.04. Therefore, readers with Ubuntu OS version 20.04 and onwards should install the libarchive-tools package to obtain the bsdtar binary. On the other hand, readers with Ubuntu OS versions earlier than 20.04 can install the bsdtar package.
In this article, we explored methods for downloading and extracting archives on the fly in Linux, bypassing the need to save them to disk. Specifically, we can pipe the archive file bytes to the decompressor as they arrive instead of needing the archive file in its entirety. What we achieve from this is that we optimize disk space, reduce processing time, and minimize I/O operations.
Firstly, we’ve demonstrated for .tar.gz files, we can download them using HTTP client command-line tools like wget or curl and pipe the content to the tar command. The tar command then extracts it in a concurrent manner.
Then, we’ve also learned that the different structure of the .zip file makes this operation slightly more challenging. Specifically, most .zip decompressor expects the archive file in its entirety before they can decompress. This is because the metadata of the .zip file is stored at the end of the archive file. Luckily, tools like bsdtar rely only on the local file headers for decompressing, therefore, do not require the complete archive file.