In Linux, gzip is a command-line tool that compresses files into the GZIP file format. One of the main reasons for using gzip is to reduce the file size to prevent disk wastage. However, the gzip command can be slow when compressing a large file. Of course, it’s inevitable that the larger the file to compress, the longer it takes. But there are some methods we can use to speed up the compression of a large file.
In this tutorial, we’ll learn how to speed up the compression of the file using some gzip options as well as some alternative tools.
2. Baseline Performance
To compare the different methods, we’ll be using the English Wikipedia 2006 dump as our file to compress. First, we’ll download the file using wget:
$ wget -O enwik9.zip http://mattmahoney.net/dc/enwik9.zip
Then, we unzip the file using the unzip command:
$ unzip -j enwik9.zip Archive: enwik9.zip inflating: enwik9 $ ls -lh total 954M -rwxr-xr-x 1 user user 954M May 6 14:38 enwik9
The -j option prevents the command from creating a folder to store the output. From the output, we can see that the file is a text file with close to 1GB worth of data.
To establish a baseline, we’ll run the gzip command using the default settings on the file and time it:
$ time gzip -c enwik9.txt > enwik9-gzip-default.gz real 0m33.929s user 0m33.537s sys 0m0.351s
As we can see, without tuning any settings, we manage to complete the compression in roughly 34 seconds. Let’s see how we can improve the performance and some of the tradeoffs we have to consider.
3. Changing the gzip Compression Level
The gzip command offers a compression level option that allows the users to tune the performance of compression.
3.1. The gzip Compression Level
In the gzip command, the compression level ranges from 1 to 9.
At level 1, the compression completes the fastest, but as a tradeoff, the compression is minimum and therefore the file size reduction is not as much. As we move up the level, the file size reduction increases at the expense of a slower compression process. At level 9, the compression is the slowest but it offers the best reduction in file size possible.
By default, the command runs with a compression level of 6.
3.2. Passing the Compression Level Option
To change the compression level, we can pass the level as an option to gzip using -#, where the # symbol is replaced with the compression level value.
For example, we can compress the file using level 1 compression level by passing the -1 option:
$ gzip -1 -c enwik9 > enwik9-level1.gz
Similarly, we can change the compression level to level 9 using the -9 option:
$ gzip -9 -c enwik9 > enwik9-level9.gz
Using an integer outside of the range causes the command to exit with an error.
3.3. Speed and Size Differences
Firstly, let’s look at the speed differences at either extreme of the compression levels, that is, at level 1 and level 9:
$ time gzip -1 -c enwik9 > enwik9-level1.gz real 0m12.705s user 0m12.165s sys 0m0.403s $ time gzip -9 -c enwik9 > enwik9-level9.gz real 0m43.461s user 0m43.064s sys 0m0.380s
At compression level 1, we are spending roughly 13 seconds compressing the file. On the other end, we are spending 30 seconds more to compress the file when we use the compression level 9.
The increase in time is expected as we are asking the compression algorithm to try its best to compress the file size, at the cost of operation time. To paint the full picture, we can see that the file size for enwik9-level1.gz is larger than enwik9-level9.gz:
$ ls -lh enwik9-level9.gz -rw-r--r-- 1 user user 308M May 7 01:45 enwik9-level9.gz $ ls -lh enwik9-level1.gz -rw-r--r-- 1 user user 361M May 7 01:49 enwik9-level1.gz
Specifically, the difference in the file size produced by level 9 and level 1 is close to 60MB.
4. Using the pigz Command
One of the main downsides of gzip is that it’s single-threaded. In other words, the vanilla gzip command cannot take advantage of the multi-core processors of our system even if it’s possible to do so. This is restricting if the bottleneck of our compression is due to CPU resources. As an alternative, we can use the pigz command.
The pigz command is an implementation of gzip that makes full use of the available cores of the processors by being multi-threaded.
It works by breaking the input into multiple chunks and performing the compression using different cores in parallel. Furthermore, pigz supports all the options in gzip, making it a drop-in replacement.
To obtain the pigz command, we can install the pigz package using our package managers, such as YUM or APT. For example, in Ubuntu-based Linux, we can run the apt-get install pigz command:
$ apt-get install -y pigz
As a verification step, we can check its version to ensure the binary is accessible through the PATH environment variable:
$ pigz --version 2.4
4.2. Compressing File With Multiple Cores
Similar to gzip, we can change its compression levels using the -# option, where the # symbol is a placeholder for the compression level ranging from 1 to 9. Similar to the gzip command, the pigz command has a default compression level of 6.
Let’s run the compression using pigz at level 1 and level 9 to compare the result with vanilla gzip:
$ time pigz -c -1 enwik9 > enwik9-pigz-level1.gz real 0m4.742s user 0m13.496s sys 0m0.536s
$ time pigz -c -9 enwik9 > enwik9-pigz-level9.gz real 0m15.240s user 0m44.860s sys 0m0.588s
From the output of the time command, we can see that the real time is much shorter than the user time, which is a manifestation of multi-core processing. This means, although the CPU resource consumption is the same, pigz completes the same compression in a shorter amount of time by distributing the compression task among all the available CPU cores. This makes sense because, underneath, the pigz command uses the same compression method as gzip, which asks for the same CPU resource consumption.
5. Using the lz4 Compression Algorithm
The gzip compression command internally uses the DEFLATE lossless compression method. In fact, a lot of other compression commands such as zip use the same compression method underneath. The widespread popularity of the DEFLATE method is mostly thanks to its good compression ratio, albeit at the cost of compression speed.
The lz4 compression algorithm is one alternative lossless compression algorithm to the DEFLATE method. In general, the lz4 compression algorithm offers a faster compression speed than DEFLATE at the cost of a lower compression ratio. It achieves this by sacrificing some features of DEFLATE, such as using a sub-optimal, but faster repetition-detection code.
Let’s see how lz4 performs on the same file.
5.1. The lz4 Command-Line Tool
To obtain the lz4 command, we can install the lz4 package using our package manager:
$ apt-get install -y lz4 $ lz4 --version *** LZ4 command line interface 64-bits v1.9.2, by Yann Collet ***
Similar to gzip, lz4 offers a compression level of 1 to 9 through the -# option. For example, to compress a file using compression level 1, we run lz4 -1:
$ lz4 -1 enwik9
5.2. Compression Speed and Ratio Against gzip
Let’s run the lz4 compression on the same file at level 1 and level 9:
$ time lz4 -1 enwik9 enwik9-lz4-level1.gz Compressed 1000000000 bytes into 509454838 bytes ==> 50.95% real 0m2.602s user 0m2.136s sys 0m0.465s $ time lz4 -9 enwik9 enwik9-lz4-level9.gz Compressed 1000000000 bytes into 374839215 bytes ==> 37.48% real 0m27.220s user 0m26.770s sys 0m0.441s
From the output of the time command, we can see that lz4 outperforms the gzip single-threaded implementation in both of the different compression levels. Specifically, to compress the same file, it takes 2.6 seconds at level 1 and 27 seconds at level 2, respectively. This is a tremendous gain in performance as compared to the 12 seconds and 43 seconds needed for the different compression levels using gzip.
However, the downside is that the compression ratio will be worse than the result of gzip compression. Concretely, at compression level 1:
$ ls -lh enwik9-lz4-level1.gz -rwxr-xr-x 1 user user 486M May 6 14:38 enwik9-lz4-level1.gz $ ls -lh enwik9-level1.gz -rw-r--r-- 1 user user 361M May 7 01:49 enwik9-level1.gz
We can see that lz4 compression results in a file that’s roughly 120MB larger as compared to the gzip compression. Similarly, let’s check with compression level 9:
$ ls -lh enwik9-lz4-level9.gz -rwxr-xr-x 1 user user 358M May 6 14:38 enwik9-lz4-level9.gz $ ls -lh enwik9-level9.gz -rw-r--r-- 1 user user 308M May 7 01:45 enwik9-level9.gz
There’s roughly a 50MB increase in the compressed file size when compressed using lz4.
6. Characteristics of Files
It’s important to understand that the content of the file affects how fast and how much a file can be compressed.
Specifically, the more the repeated substrings in a text file, the more the saving that can be achieved. As a rule of thumb, a text file with frequently repeating text patterns, such as HTML and English articles is usually more compressible than a text file containing a random sequence of characters and binary files.
To demonstrate this point, let’s generate a text file with a random sequence of characters:
$ base64 /dev/urandom | head -c 1000000000 > random1gb.txt
Then, we can take a peek at the file and see that it consists mostly of a random sequence of characters:
head -n 3 random1gb.txt gAuluNTIzQWk584fk+X26CeGE9fnJnYJhp1hoTAYL/GWl4ct+NU1UEjae7wFL9njAefSWbIfoF/Q JUYw3nbkI18MrUfyDDPRP1h4UgqbZvk34RdP9K48VUrawC0gHP75G0N8cRqDkYxAy6qTaT+ICGFl bBkkcaVUQ9pgYrmixF+KYaVobR/AQ23TIGl1rjOXckaRR2a0kbPCfaijs584fPFlyrDzY/9XarDi
Now, let’s run the gzip command on this file using different levels of compression and observe the speed and compression ratio:
$ time gzip -1 -c random1gb.txt > random1gb-level1.gz real 0m30.341s user 0m28.715s sys 0m0.760s $ time gzip -9 -c random1gb.txt > random1gb-level9.gz real 0m33.059s user 0m32.402s sys 0m0.651s
The speed difference between the different compression levels is almost negligible in this case as compared to the English Wikipedia dump we’ve been using throughout the tutorial.
Let’s check the file size:
$ ls -lh random1gb-level*.gz -rw-r--r-- 1 user user 741M May 7 06:42 random1gb-level1.gz -rw-r--r-- 1 user user 725M May 7 06:43 random1gb-level9.gz
Again, the difference is insignificant between the different end of compression levels. Additionally, the reduction in file size isn’t as significant, saving roughly 300MB of space as compared to the English Wikipedia dataset where 700MB of saving can be observed.
The point is, before we can jump into these different methods in this tutorial, it’s important we first identify the characteristics of the file in question. If the file contains a mostly random sequence of characters, compression might not be even useful in the first place.
In this article, we learned about the different compression levels of the gzip command and understood how they can be tuned to speed up the compression duration.
Then, we also saw that the vanilla gzip command is single-threaded and cannot take advantage of the multi-cores available. As an alternative, we used pigz on the same file and observed a tremendous speed-up thanks to its multi-threaded implementation.
Then, we looked at an entirely different compression method, the lz4, and saw how it outperforms gzip in terms of speed while resulting in a larger file. Finally, we learned that the content of the files plays a huge role in the compression performance.