Speeding Up gzip Compression | Baeldung on Linux

1. Overview

In Linux, gzip is a command-line tool that compresses files into the GZIP file format. One of the main reasons for using gzip is to reduce the file size to prevent disk wastage. However, the gzip command can be slow when compressing a large file. Of course, it’s inevitable that the larger the file to compress, the longer it takes. But there are some methods we can use to speed up the compression of a large file.

In this tutorial, we’ll learn how to speed up the compression of the file using some gzip options as well as some alternative tools.

2. Baseline Performance

To compare the different methods, we’ll be using the English Wikipedia 2006 dump as our file to compress. First, we’ll download the file using wget:

$ wget -O enwik9.zip http://mattmahoney.net/dc/enwik9.zip

Then, we unzip the file using the unzip command:

$ unzip -j enwik9.zip
Archive:  enwik9.zip
  inflating: enwik9
$ ls -lh
total 954M
-rwxr-xr-x 1 user user 954M May  6 14:38 enwik9

The -j option prevents the command from creating a folder to store the output. From the output, we can see that the file is a text file with close to 1GB worth of data.

To establish a baseline, we’ll run the gzip command using the default settings on the file and time it:

$ time gzip -c enwik9.txt > enwik9-gzip-default.gz

real    0m33.929s
user    0m33.537s
sys     0m0.351s

As we can see, without tuning any settings, we manage to complete the compression in roughly 34 seconds. Let’s see how we can improve the performance and some of the tradeoffs we have to consider.

3. Changing the gzip Compression Level

The gzip command offers a compression level option that allows the users to tune the performance of compression.

3.1. The gzip Compression Level

In the gzip command, the compression level ranges from 1 to 9.

At level 1, the compression completes the fastest, but as a tradeoff, the compression is minimum and therefore the file size reduction is not as much. As we move up the level, the file size reduction increases at the expense of a slower compression process. At level 9, the compression is the slowest but it offers the best reduction in file size possible.

By default, the command runs with a compression level of 6.

3.2. Passing the Compression Level Option

To change the compression level, we can pass the level as an option to gzip using -#, where the # symbol is replaced with the compression level value.

For example, we can compress the file using level 1 compression level by passing the -1 option:

$ gzip -1 -c enwik9 > enwik9-level1.gz

Similarly, we can change the compression level to level 9 using the -9 option:

$ gzip -9 -c enwik9 > enwik9-level9.gz

Using an integer outside of the range causes the command to exit with an error.

3.3. Speed and Size Differences

Firstly, let’s look at the speed differences at either extreme of the compression levels, that is, at level 1 and level 9:

$ time gzip -1 -c enwik9 > enwik9-level1.gz

real    0m12.705s
user    0m12.165s
sys     0m0.403s
$ time gzip -9 -c enwik9 > enwik9-level9.gz

real    0m43.461s
user    0m43.064s
sys     0m0.380s

At compression level 1, we are spending roughly 13 seconds compressing the file. On the other end, we are spending 30 seconds more to compress the file when we use the compression level 9.

The increase in time is expected as we are asking the compression algorithm to try its best to compress the file size, at the cost of operation time. To paint the full picture, we can see that the file size for enwik9-level1.gz is larger than enwik9-level9.gz:

$ ls -lh enwik9-level9.gz
-rw-r--r-- 1 user user 308M May  7 01:45 enwik9-level9.gz
$ ls -lh enwik9-level1.gz
-rw-r--r-- 1 user user 361M May  7 01:49 enwik9-level1.gz

Specifically, the difference in the file size produced by level 9 and level 1 is close to 60MB.

4. Using the pigz Command

One of the main downsides of gzip is that it’s single-threaded. In other words, the vanilla gzip command cannot take advantage of the multi-core processors of our system even if it’s possible to do so. This is restricting if the bottleneck of our compression is due to CPU resources. As an alternative, we can use the pigz command.

The pigz command is an implementation of gzip that makes full use of the available cores of the processors by being multi-threaded.

It works by breaking the input into multiple chunks and performing the compression using different cores in parallel. Furthermore, pigz supports all the options in gzip, making it a drop-in replacement.

4.1. Installation

To obtain the pigz command, we can install the pigz package using our package managers, such as YUM or APT. For example, in Ubuntu-based Linux, we can run the apt-get install pigz command:

$ apt-get install -y pigz

As a verification step, we can check its version to ensure the binary is accessible through the PATH environment variable:

$ pigz --version
2.4

4.2. Compressing File With Multiple Cores

Similar to gzip, we can change its compression levels using the -# option, where the # symbol is a placeholder for the compression level ranging from 1 to 9. Similar to the gzip command, the pigz command has a default compression level of 6.

Let’s run the compression using pigz at level 1 and level 9 to compare the result with vanilla gzip:

$ time pigz -c -1 enwik9 > enwik9-pigz-level1.gz

real    0m4.742s
user    0m13.496s
sys     0m0.536s

$ time pigz -c -9 enwik9 > enwik9-pigz-level9.gz

real    0m15.240s
user    0m44.860s
sys     0m0.588s

From the output of the time command, we can see that the real time is much shorter than the user time, which is a manifestation of multi-core processing. This means, although the CPU resource consumption is the same, pigz completes the same compression in a shorter amount of time by distributing the compression task among all the available CPU cores. This makes sense because, underneath, the pigz command uses the same compression method as gzip, which asks for the same CPU resource consumption.

5. Using the lz4 Compression Algorithm

The gzip compression command internally uses the DEFLATE lossless compression method. In fact, a lot of other compression commands such as zip use the same compression method underneath. The widespread popularity of the DEFLATE method is mostly thanks to its good compression ratio, albeit at the cost of compression speed.

The lz4 compression algorithm is one alternative lossless compression algorithm to the DEFLATE method. In general, the lz4 compression algorithm offers a faster compression speed than DEFLATE at the cost of a lower compression ratio. It achieves this by sacrificing some features of DEFLATE, such as using a sub-optimal, but faster repetition-detection code.

Let’s see how lz4 performs on the same file.

5.1. The lz4 Command-Line Tool

To obtain the lz4 command, we can install the lz4 package using our package manager:

$ apt-get install -y lz4
$ lz4 --version
*** LZ4 command line interface 64-bits v1.9.2, by Yann Collet ***

Similar to gzip, lz4 offers a compression level of 1 to 9 through the -# option. For example, to compress a file using compression level 1, we run lz4 -1:

$ lz4 -1 enwik9

5.2. Compression Speed and Ratio Against gzip

Let’s run the lz4 compression on the same file at level 1 and level 9:

$ time lz4 -1 enwik9 enwik9-lz4-level1.gz
Compressed 1000000000 bytes into 509454838 bytes ==> 50.95%

real    0m2.602s
user    0m2.136s
sys     0m0.465s
$ time lz4 -9 enwik9 enwik9-lz4-level9.gz
Compressed 1000000000 bytes into 374839215 bytes ==> 37.48%

real    0m27.220s
user    0m26.770s
sys     0m0.441s

From the output of the time command, we can see that lz4 outperforms the gzip single-threaded implementation in both of the different compression levels. Specifically, to compress the same file, it takes 2.6 seconds at level 1 and 27 seconds at level 2, respectively. This is a tremendous gain in performance as compared to the 12 seconds and 43 seconds needed for the different compression levels using gzip.

However, the downside is that the compression ratio will be worse than the result of gzip compression. Concretely, at compression level 1:

$ ls -lh enwik9-lz4-level1.gz
-rwxr-xr-x 1 user user 486M May  6 14:38 enwik9-lz4-level1.gz
$ ls -lh enwik9-level1.gz
-rw-r--r-- 1 user user 361M May  7 01:49 enwik9-level1.gz

We can see that lz4 compression results in a file that’s roughly 120MB larger as compared to the gzip compression. Similarly, let’s check with compression level 9:

$ ls -lh enwik9-lz4-level9.gz
-rwxr-xr-x 1 user user 358M May  6 14:38 enwik9-lz4-level9.gz
$ ls -lh enwik9-level9.gz
-rw-r--r-- 1 user user 308M May  7 01:45 enwik9-level9.gz

There’s roughly a 50MB increase in the compressed file size when compressed using lz4.

6. Characteristics of Files

It’s important to understand that the content of the file affects how fast and how much a file can be compressed.

Specifically, the more the repeated substrings in a text file, the more the saving that can be achieved. As a rule of thumb, a text file with frequently repeating text patterns, such as HTML and English articles is usually more compressible than a text file containing a random sequence of characters and binary files.

To demonstrate this point, let’s generate a text file with a random sequence of characters:

$ base64 /dev/urandom | head -c 1000000000 > random1gb.txt

Then, we can take a peek at the file and see that it consists mostly of a random sequence of characters:

head -n 3 random1gb.txt
gAuluNTIzQWk584fk+X26CeGE9fnJnYJhp1hoTAYL/GWl4ct+NU1UEjae7wFL9njAefSWbIfoF/Q
JUYw3nbkI18MrUfyDDPRP1h4UgqbZvk34RdP9K48VUrawC0gHP75G0N8cRqDkYxAy6qTaT+ICGFl
bBkkcaVUQ9pgYrmixF+KYaVobR/AQ23TIGl1rjOXckaRR2a0kbPCfaijs584fPFlyrDzY/9XarDi

Now, let’s run the gzip command on this file using different levels of compression and observe the speed and compression ratio:

$ time gzip -1 -c random1gb.txt > random1gb-level1.gz

real    0m30.341s
user    0m28.715s
sys     0m0.760s
$ time gzip -9 -c random1gb.txt > random1gb-level9.gz

real    0m33.059s
user    0m32.402s
sys     0m0.651s

The speed difference between the different compression levels is almost negligible in this case as compared to the English Wikipedia dump we’ve been using throughout the tutorial.

Let’s check the file size:

$ ls -lh random1gb-level*.gz
-rw-r--r-- 1 user user 741M May  7 06:42 random1gb-level1.gz
-rw-r--r-- 1 user user 725M May  7 06:43 random1gb-level9.gz

Again, the difference is insignificant between the different end of compression levels. Additionally, the reduction in file size isn’t as significant, saving roughly 300MB of space as compared to the English Wikipedia dataset where 700MB of saving can be observed.

The point is, before we can jump into these different methods in this tutorial, it’s important we first identify the characteristics of the file in question. If the file contains a mostly random sequence of characters, compression might not be even useful in the first place.

7. Conclusion

In this article, we learned about the different compression levels of the gzip command and understood how they can be tuned to speed up the compression duration.

Then, we also saw that the vanilla gzip command is single-threaded and cannot take advantage of the multi-cores available. As an alternative, we used pigz on the same file and observed a tremendous speed-up thanks to its multi-threaded implementation.

Then, we looked at an entirely different compression method, the lz4, and saw how it outperforms gzip in terms of speed while resulting in a larger file. Finally, we learned that the content of the files plays a huge role in the compression performance.

Full Archive

About Baeldung

Administration

Filesystems

Processes

Files

Scripting

Installation

Networking

Security