Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE

1. Introduction

In this tutorial, we’ll explore different Java libraries that we can use to extract tar archives. The tar format originated as a Unix-based utility to package files together, uncompressed. But today, it’s very common to compress tar archives with gzip. So, we’ll see how compressed vs. uncompressed tar archives affect our code.

2. Creating a Base Class for Implementations

To avoid boilerplate, let’s start with an abstract class we’ll use as the basis for our implementations. This class will define a single abstract method, untar(), which will perform the extraction:

public abstract class TarExtractor {

    private InputStream tarStream;
    private boolean gzip;
    private Path destination;

    // ...

    public abstract void untar() throws IOException;
}

Now, let’s define a couple of constructors for our base class. The primary constructor will receive a tar archive as an InputStream, whether the contents are compressed, and a Path to where the files will be extracted:

protected TarExtractor(InputStream in, boolean gzip, Path destination) throws IOException {
    this.tarStream = in;
    this.gzip = gzip;
    this.destination = destination;

    Files.createDirectories(destination);
}

Most importantly, we create the base directory structure for the files we’re extracting with Files.createDirectories(). This way, we don’t need to create the destination folder ourselves. For the sake of simplicity, we’re using a boolean to define if our archive is using gzip or not. So, we don’t need to write code to detect the actual file type by its contents.

Then, in our second constructor, we’ll accept a Path to a tar archive and determine if it’s compressed based on the file name. Note that this relies on the file name being correct:

protected TarExtractor(Path tarFile, Path destination) throws IOException {
    this(Files.newInputStream(tarFile), tarFile.endsWith("gz"), destination);
}

Finally, to simplify tests, we’ll create a class with a method that returns a tar archive from our resources folder:

public interface Resources {
    
    static InputStream tarGzFile() {
        return Resources.class.getResourceAsStream("/untar/test.tar.gz");
    }
}

This can be any tar archive compressed with gzip. We just put it in a method to avoid “stream closed” errors.

3. Extraction Using Apache Commons Compression

In our first implementation, we’ll use the Apache Commons library commons-compress:

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-compress</artifactId>
    <version>1.23.0</version>
</dependency>

The solution involves instantiating a TarArchiveInputStream, which will receive our archive stream. Then, we need to wrap it inside a GzipCompressorInputStream if using gzip:

public class TarExtractorCommonsCompress extends TarExtractor {

    protected TarExtractorCommonsCompress(InputStream in, boolean gzip, Path destination) throws IOException {
        super(in, gzip, destination);
    }

    public void untar() throws IOException {
        try (BufferedInputStream inputStream = new BufferedInputStream(getTarStream());
          TarArchiveInputStream tar = new TarArchiveInputStream(
          isGzip() ? new GzipCompressorInputStream(inputStream) : inputStream)) {
            ArchiveEntry entry;
            while ((entry = tar.getNextEntry()) != null) {
                Path extractTo = getDestination().resolve(entry.getName());
                if (entry.isDirectory()) {
                    Files.createDirectories(extractTo);
                } else {
                    Files.copy(tar, extractTo);
                }
            }
        }
    }
}

First, we iterate over our TarArchiveInputStream. For this, we must check if getNextEntry() returns an ArchiveEntry. Then, if it’s a directory, we create it relative to our destination folder. This way, we don’t get an error when writing a file inside it. Otherwise, we use Files.copy() from our tar to where we want to extract it.

Let’s test it by extracting the archive contents into an arbitrary folder:

@Test
public void givenTarGzFile_whenUntar_thenExtractedToDestination() throws IOException {
    Path destination = Paths.get("/tmp/commons-compress-gz");

    new TarExtractorCommonsCompress(Resources.tarGzFile(), true, destination).untar();

    try (Stream files = Files.list(destination)) {
        assertTrue(files.findFirst().isPresent());
    }
}

If our archive weren’t using gzip, we’d only need to pass false when instantiating our TarExtractorCommonsCompress object. Also, note that GzipCompressorInputStream can extract formats other than gzip.

4. Extraction Using Apache Ant

With Apache ant, we can get close to a core Java implementation, as we can use GZIPInputStream from java.util in case our archive is using gzip:

<dependency>
    <groupId>org.apache.ant</groupId>
    <artifactId>ant</artifactId>
    <version>1.10.13</version>
</dependency>

We’ll have a very similar implementation:

public class TarExtractorAnt extends TarExtractor {

    // standard delegate constructor

    public void untar() throws IOException {
        try (TarInputStream tar = new TarInputStream(new BufferedInputStream(
          isGzip() ? new GZIPInputStream(getTarStream()) : getTarStream()))) {
            TarEntry entry;
            while ((entry = tar.getNextEntry()) != null) {
                Path extractTo = getDestination().resolve(entry.getName());
                if (entry.isDirectory()) {
                    Files.createDirectories(extractTo);
                } else {
                    Files.copy(tar, extractTo);
                }
            }
        }
    }
}

The logic is the same here, but we use TarInputStream and TarEntry from Apache Ant instead of TarArchiveInputStream and ArchiveEntry. We can test it the same way as the previous solution:

@Test
public void givenTarGzFile_whenUntar_thenExtractedToDestination() throws IOException {
    Path destination = Paths.get("/tmp/ant-gz");

    new TarExtractorAnt(Resources.tarGzFile(), true, destination).untar();

    try (Stream files = Files.list(destination)) {
        assertTrue(files.findFirst().isPresent());
    }
}

5. Extraction Using Apache VFS

In our last example, we’ll use Apache commons-vfs2, which supports different file system schemes with a single API. One of them is tar archives:

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-vfs2</artifactId>
    <version>2.9.0</version>
</dependency>

But, since we’re reading from an input stream, we’ll first need to save our stream to a temp file so we can generate a URI afterward:

public class TarExtractorVfs extends TarExtractor {

    // standard delegate constructor

    public void untar() throws IOException {
        Path tmpTar = Files.createTempFile("temp", isGzip() ? ".tar.gz" : ".tar");
        Files.copy(getTarStream(), tmpTar);

        // ...

        Files.delete(tmpTar);
    }
}

We’ll delete our temp file at the end of our extraction. Next, we’ll get an instance of a FileSystemManager and resolve our file URI into a FileObject, which we’ll then use to iterate over our archive contents:

FileSystemManager fsManager = VFS.getManager();
String uri = String.format("%s:file://%s", isGzip() ? "tgz" : "tar", tmpTar);
FileObject tar = fsManager.resolveFile(uri);

Note that, for resolveFile(), we construct our URI differently if we’re using gzip, prefixing it with “tgz” (which means tar+gzip) instead of “tar”. Then, at last, we iterate over our archive contents, extracting each file:

for (FileObject entry : tar) {
    Path extractTo = Paths.get(
      getDestination().toString(), entry.getName().getPath());

    if (entry.isReadable() && entry.getType() == FileType.FILE) {
        Files.createDirectories(extractTo.getParent());

        try (FileContent content = entry.getContent(); 
          InputStream stream = content.getInputStream()) {
            Files.copy(stream, extractTo);
        }
    }
}

And, because we might receive our items out of order, we’ll check if our entry is a file and call createDirectories() on its parent. This way, we don’t risk creating a file before creating its directory. Lastly, since the entry path is returned with a leading slash, we won’t use Paths.resolve() to create our destination files, like in previous implementations. Let’s test it:

@Test
public void givenTarGzFile_whenUntar_thenExtractedToDestination() throws IOException {
    Path destination = Paths.get("/tmp/vfs-gz");

    new TarExtractorVfs(Resources.tarGzFile(), true, destination).untar();

    try (Stream files = Files.list(destination)) {
        assertTrue(files.findFirst().isPresent());
    }
}

This solution is only helpful if we already use VFS in our project, as it requires a little more code.

6. Conclusion

In this article, we learned how to extract tar archives using different libraries. Our implementations extended from a base class, reducing our code and making them simpler to use.

And as always, the source code is available over on GitHub.

Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE
res – REST with Spring (eBook) (everywhere)
2 Comments
Oldest
Newest
Inline Feedbacks
View all comments
Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.