1. Overview

In this tutorial, we’ll see how to determine if a file is compressed in one of the classic formats used by compression utilities such as FileRoller, Ark, and others.

Our goal is to detect any tape archive file or any other data compressed with gzip, bzip2, compress, lrzip, lzip, lzma, lzop, rar, 7z, zip, xz, or zstd. The approaches we’ll see can be extended to other compression algorithms.

On the other hand, we won’t discuss formats that are intrinsically compressed, such as images, audio, and video. Nor will we deal with ZIP archives that are actually document formats, such as EPUB, Office Open XML, and OpenDocument.

2. Our Testing Files

Let’s save this Bash script as create_compressed_files.sh and run it to create enough test files to cover a wide range of compressed file types commonly used in the Linux world:

#!/bin/bash

# Check for the presence of required compression programs
required_programs=("7z" "rar" "zip" "gzip" "bzip2" "compress" "lrzip" "lzip" "lzma" "lzop" "xz" "zstd")

for program in "${required_programs[@]}"; do
    if ! command -v "$program" &> /dev/null; then
        echo "Error: $program is not installed. Please install it and try again."
        exit 1
    fi
done

# Create sample files
echo "Test 1" > test1.txt
echo "Test 2" > test2.txt
echo "Test 3" > test3.txt

# Create 7-Zip archive
7z a -t7z test.7z test1.txt test2.txt test3.txt

# Create RAR archive
rar a test.rar test1.txt test2.txt test3.txt

# Create ZIP archive
zip test.zip test1.txt test2.txt test3.txt

# Create tape archive file compressed with gzip
tar czf test.tar.gz test1.txt test2.txt test3.txt

# Create tape archive file compressed with bzip2
tar cjf test.tar.bz2 test1.txt test2.txt test3.txt

# Create tape archive file compressed with compress
tar cZf test.tar.Z test1.txt test2.txt test3.txt

# Create tape archive file compressed with lrzip
tar --lzop -cf test.tar.lrz test1.txt test2.txt test3.txt

# Create tape archive file compressed with lzip
tar --lzip -cf test.tar.lz test1.txt test2.txt test3.txt

# Create tape archive file compressed with lzma
tar --lzma -cf test.tar.lzma test1.txt test2.txt test3.txt

# Create tape archive file compressed with lzop
tar --lzop -cf test.tar.lzo test1.txt test2.txt test3.txt

# Create tape archive file compressed with 7z (requires p7zip-full package)
7z a test.tar.7z test1.txt test2.txt test3.txt

# Create tape archive file compressed with xz
tar --xz -cf test.tar.xz test1.txt test2.txt test3.txt

# Create tape archive file compressed with zstd
tar --zstd -cf test.tar.zst test1.txt test2.txt test3.txt

# Compress test1.txt with various compressors
gzip -c test1.txt > test.gz
bzip2 -c test1.txt > test.bz2
compress -c test1.txt > test.Z
lrzip -q -o test.lrz test1.txt
lzip -c test1.txt > test.lz
lzma -c test1.txt > test.lzma
lzop -q -o test.lzo test1.txt
xz -c test1.txt > test.xz
zstd -q -o test.zst test1.txt

# Clean up temporary files
rm test1.txt test2.txt test3.txt

echo "All sample compressed files were created successfully."

The check at the beginning of the script reports any missing compression utility, whose package name is usually the name of the program itself. But in the case of compress, the relevant package on Debian and Fedora is ncompress.

If all goes well, we’ll see 22 compressed files in our file manager:

create_compressed_files.sh executedAll of these test files have the correct extensions. We’ll address the possibility of invalid or missing extensions when we deal with magic numbers.

3. File Extensions

The first step in recognizing a compressed file is to check its extension. Although this doesn’t guarantee the actual contents of the file, it’s a good place to start.

The same type of compressed file can have more than one valid extension. Instead of trying to remember all the possibilities, we can create a Bash script that takes a file as input and returns the file type based on the extension as output. Let’s save it as check_extension.sh:

#!/bin/bash

# Get the file name without path
filename=$(basename "$1")

# Get the extension(s) of the file
ext="${filename##*.}"
if [[ "$ext" == "$filename" ]]; then
  echo "Extension absent"
  exit 1;
fi

# Check if there is a double extension
base="${filename%.*}"
ext2="${base##*.}"
if [[ "$ext2" == "tar" ]]; then
  ext="$ext2.$ext"
fi

# Match the extension with the file type
case "$ext" in
  7z) echo "7-Zip archive";;
  bz2) echo "file compressed with \"bzip2\"";;
  gz) echo "file compressed with \"gzip\"";;
  lrz) echo "file compressed with \"lrzip\"";;
  lz) echo "file compressed with \"lzip\"";;
  lzma) echo "file compressed with \"lzma\"";;
  lzo) echo "file compressed with \"lzop\"";;
  rar) echo "RAR archive";;
  tar.7z) echo "tape archive file compressed with \"7z\"";;
  tar.bz2|tbz2|tb2|tbz|tz2) echo "tape archive file compressed with \"bzip2\"";;
  tar.gz|tgz) echo "tape archive file compressed with \"gzip\"";;
  tar.lrz|tlrz) echo "tape archive file compressed with \"lrzip\"";;
  tar.lz|tlz) echo "tape archive file compressed with \"lzip\"";;
  tar.lzma) echo "tape archive file compressed with \"lzma\"";;
  tar.lzo|tzo) echo "tape archive file compressed with \"lzop\"";;
  tar.xz|txz) echo "tape archive file compressed with \"xz\"";;
  tar.Z|taz) echo "tape archive file compressed with \"compress\"";;
  tar.zst|tzst) echo "tape archive file compressed with \"zstd\"";;
  xz) echo "file compressed with \"xz\"";;
  Z) echo "file compressed with \"compress\"";;
  zip) echo "ZIP archive";;
  zst) echo "file compressed with \"zstd\"";;
  *) echo "Extension not recognized";;
esac

We can do some tests. There is no need for the tested file to actually exist since this script relies only on the file extension without accessing the file contents:

$ ./check_extension.sh test.7z
7-Zip archive

$ ./check_extension.sh test.tar.7z
tape archive file compressed with "7z"

$ ./check_extension.sh test.tar.bz2
tape archive file compressed with "bzip2"

$ ./check_extension.sh test.tb2
tape archive file compressed with "bzip2"

$ ./check_extension.sh test.xyz
Extension not recognized

$ ./check_extension.sh test
Extension absent

The upcoming methods will be more robust because they will try to determine the actual contents of the files.

4. Magic Number

A magic number, also known as a file signature or magic bytes, is a sequence of bytes at the beginning of a file. It’s especially helpful for determining the file format regardless of the extension, which can sometimes be missing or incorrect.

For example, let’s run od -t x1 test.gz to see the contents of the file test.gz in hexadecimal format, with 16 bytes of data displayed on each line:

$ od -t x1 test.gz
0000000 1f 8b 08 08 18 3e e2 64 00 03 74 65 73 74 31 2e
[...]

Each pair of characters represents a byte in hexadecimal notation. For example, 1f is actually a byte in this context, corresponding to the sequence of bits 00011111. The same goes for 8b, which corresponds to 10001011 and so on.

In this example, the first two bytes 1f 8b are the magic number that identifies data compressed with gzip.

4.1. file Command

The file utility uses a database of magic numbers that is updated by the maintainers of file, and it can vary between different versions and distributions of Linux. It also uses strategies to detect certain types of files that don’t have a magic number.

In most cases, it provides accurate information regardless of the file extension:

$ file test.tar.Z
test.tar.Z: compress'd data 16 bits
$ file test.zst
test.zst: Zstandard compressed data (v0.8+), Dictionary ID: None
$ cp test.tar.lzo noextension
$ cp test.tar.lzo invalidext.zip
$ file noextension 
noextension: lzop compressed data - version 1.040, LZO1X-1, os: Unix
$ file invalidext.zip 
invalidext.zip: lzop compressed data - version 1.040, LZO1X-1, os: Unix

In this case, it returns a result that is correct for all of our example files. Let’s keep the files noextension and invalidext.zip because we’ll use them again.

Although file is generally sufficient for a quick check, there are some drawbacks and situations where manually checking the first few bytes of a file or using custom scripts to compare them against a magic number database may be more appropriate:

  • file may not recognize some file formats that are unusual, custom, too old, or too new
  • In security-sensitive scenarios, we can’t rely on file alone because there may be malware designed to fool it
  • We might need to extract additional information from a file header that file doesn’t provide, such as specific metadata or attributes
  • Writing custom scripts allows us to have full control over the identification process, especially on systems where file isn’t available or behaves differently

In addition, a custom script can help us check the results of file for any discrepancies or anomalies. Before we go any further, let’s see what magic numbers we should be looking for.

4.2. Magic Numbers of Compression Formats

These are the magic numbers of all the compression formats we’re considering:

These magic numbers are specific to the most common versions and implementations of each compression format. Different versions or custom implementations may have slightly different magic numbers.

4.3. Bash Script to Check Magic Numbers

Now that we know the magic numbers of the formats we’re interested in, let’s create a Bash script that takes a file as input and returns the compression tool that created it.

It requires the head and xxd utilities. Let’s save it as check_magic_number.sh:

#!/bin/bash

# A function to check the magic number of a file and return the compression tool
check_magic_number () {
  # Read the first 12 bytes of the file in hexadecimal format
  magic_number=$(head -c 12 "$1" | xxd -p)

  # Compare the magic number with the known values and print the tool name
  case $magic_number in
    1f8b*|1f9e*) echo "data compressed with: gzip";;
    425a68*|425a30*) echo "data compressed with: bzip2";;
    1f9d*|1fa0*) echo "data compressed with: compress";;
    4c525a49*) echo "data compressed with: lrzip";;
    4c5a4950*) echo "data compressed with: lzip";;
    5d0000*) echo "data compressed with: lzma";;
    894c5a4f000d0a1a0a*) echo "data compressed with: lzop";;
    52457e5e*|526172211a0700*|526172211a070100*) echo "data compressed with: rar";;
    377abcaf271c*) echo "data compressed with: 7z";;
    504b0304*) echo "data compressed with: zip";;
    fd377a585a*) echo "data compressed with: xz";;
    28b52ffd*|25b52ffd*) echo "data compressed with: zstd";;
    *) echo "Unknown magic number";;
  esac
}

# Check if a file name is provided as an argument
if [ $# -eq 0 ]; then
  echo "Please provide a file name as an argument."
  exit 1
fi

# Check if the file exists and is readable
if [ ! -f "$1" ] || [ ! -r "$1" ]; then
  echo "The file does not exist or cannot be read."
  exit 2
fi

# Check if the file is empty
if [ ! -s "$1" ]; then
  echo "The file is empty."
  exit 3
fi

# Call the function to check the magic number and print the result
check_magic_number "$1"

Now we can check some example files we previously created:

$ ./check_magic_number.sh test.7z
data compressed with: 7z
$ ./check_magic_number.sh test.bz2
data compressed with: bzip2
$ ./check_magic_number.sh test.gz
data compressed with: gzip

We won’t report all 22 checks, as they are as expected, with the sole exception of test.tar.lzma, whose magic number is that of xz. This is acceptable because, according to the xz man page, lzma is an alias of xz –format=lzma and uses the same compression algorithm.

Much more interesting is what happens with a missing or invalid extension:

$ ./check_magic_number.sh noextension 
data compressed with: lzop
$ ./check_magic_number.sh invalidext.zip 
data compressed with: lzop

The result is as expected, the magic number check is completely independent of the extension.

5. Conclusion

In this article, we’ve seen how to tell if a file is compressed under Linux.

First, we looked at how to tell the file type by its extension, although this approach isn’t always reliable. Then, we found that the simplest approach is to use the file command, which already has a large internal database of magic numbers that will suffice in the vast majority of cases.

We then discussed situations where manual inspection or custom scripts with their own magic number database are needed to handle special cases, custom formats, or when additional control or security is required.

Comments are closed on this article!