Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE

1. Overview

In this tutorial, we’ll learn how to split a large file in Java. First, we’ll compare reading files in memory with reading files using streams. Later, we’ll learn to split files based on their size and number.

2. Read File In-Memory vs. Stream

Whenever we read files in memory, the JVM keeps all the lines in memory. This is a good choice for small files. For large files, however, it frequently results in an OutOfMemoryException.

Streaming through a file is another way to read it, and there are many ways to stream and read large files. Because the whole file isn’t in memory, it consumes less memory and works well with large files without throwing an exception.

For our examples, we’ll use streams to read the large files.

3. File Split by File Size

While we’ve learned to read large files so far, sometimes we need to split them into smaller files or send them over the network in smaller sizes.
First, we’ll begin by splitting the large file into smaller files, each with a specific size.
For our example, we’ll take one 4.3MB file, largeFile.txt, in our project src/main/resource folder and split it into 1MB each files, and store them under the /target/split directory.
Let’s first get the large file and open an input stream on it:

File largeFile = new File("LARGE_FILE_PATH");
InputStream inputstream = Files.newInputStream(largeFile.toPath());

Here, we’re just loading the file metadata, the large file content isn’t loaded into memory yet.

For our example, we’ve got a constant fixed size. In practical use cases, this maxSizeOfSplitFiles value can be dynamically read and changed as per application need.

Now, let’s have a method that takes the largeFile object and a defined maxSizeOfSplitFiles for the split file:

public List<File> splitByFileSize(File largeFile, int maxSizeOfSplitFiles, String splitFileDirPath) 
  throws IOException {
    // ...
}

Now, let’s create a class SplitLargeFile and splitByFileSize() method:

class SplitLargeFile {

    public List<File> splitByFileSize(File largeFile, int maxSizeOfSplitFiles, String splitFileDirPath) 
      throws IOException {

        List<File> listOfSplitFiles = new ArrayList<>();
        try (InputStream in = Files.newInputStream(largeFile.toPath())) {
            final byte[] buffer = new byte[maxSizeOfSplitFiles];
            int dataRead = in.read(buffer);
            while (dataRead > -1) {
                File splitFile = getSplitFile(FilenameUtils.removeExtension(largeFile.getName()),
                  buffer, dataRead, splitFileDirPath);
                listOfSplitFiles.add(splitFile);
                dataRead = in.read(buffer);
            }
        }
        return listOfSplitFiles;
    }

    private File getSplitFile(String largeFileName, byte[] buffer, int length, String splitFileDirPath) 
      throws IOException {

        File splitFile = File.createTempFile(largeFileName + "-", "-split", new File(splitFileDirPath));
        try (FileOutputStream fos = new FileOutputStream(splitFile)) {
            fos.write(buffer, 0, length);
        }
        return splitFile;
    }
}

Using maxSizeOfSplitFiles, we can specify how many bytes each smaller chunked file can be.
The maxSizeOfSplitFiles amount of data will be loaded into memory, processed, and made into a small file. We then get rid of it. We read the next set of maxSizeOfSplitFiles data. This ensures that no OutOfMemoryException is thrown.
As a final step, the method returns a list of split files stored under the splitFileDirPath.
We can store the split file in any temporary directory or any custom directory.
Now, let’s test this:

public class SplitLargeFileUnitTest {

    @BeforeClass
    static void prepareData() throws IOException {
        Files.createDirectories(Paths.get("target/split"));
    }

    private String splitFileDirPath() throws Exception {
        return Paths.get("target").toString() + "/split";
    }

    private Path largeFilePath() throws Exception {
        return Paths.get(this.getClass().getClassLoader().getResource("largeFile.txt").toURI());
    }

    @Test
    void givenLargeFile_whenSplitLargeFile_thenSplitBySize() throws Exception {
        File input = largeFilePath().toFile();
        SplitLargeFile slf = new SplitLargeFile();
        slf.splitByFileSize(input, 1024_000, splitFileDirPath());
    }
}

Finally, once we test, we can see that the program splits the large file into four files of 1MB and one file of 240KB and puts them under the project target/split directory.

4. File Split by File Count

Now, let’s split the given large file into a specified number of smaller files. For this, first, we’ll check if the size of small files will fit or not as per the number of files counted.

Also, we’ll use the same method splitByFileSize() from earlier internally for the actual splitting.

Let’s create a method splitByNumberOfFiles():

class SplitLargeFile {

    public List<File> splitByNumberOfFiles(File largeFile, int noOfFiles, String splitFileDirPath)
      throws IOException {
        return splitByFileSize(largeFile, getSizeInBytes(largeFile.length(), noOfFiles), splitFileDirPath);
    }

    private int getSizeInBytes(long largefileSizeInBytes, int numberOfFilesforSplit) {
        if (largefileSizeInBytes % numberOfFilesforSplit != 0) {
            largefileSizeInBytes = ((largefileSizeInBytes / numberOfFilesforSplit) + 1) * numberOfFilesforSplit;
        }
        long x = largefileSizeInBytes / numberOfFilesforSplit;
        if (x > Integer.MAX_VALUE) {
            throw new NumberFormatException("size too large");
        }
        return (int) x;
    }
}

Now, let’s test this:

@Test
void givenLargeFile_whenSplitLargeFile_thenSplitByNumberOfFiles() throws Exception { 
    File input = largeFilePath().toFile(); 
    SplitLargeFile slf = new SplitLargeFile(); 
    slf.splitByNumberOfFiles(input, 3, splitFileDirPath()); 
}

Finally, once we test, we can see that the program splits the large file into 3 files of 1.4MB and puts it under the project target/split dir.

5. Conclusion

In this article, we saw the differences between reading files in memory and via stream, which helps us choose the appropriate one for any use case. Later, we discussed how to split large files into small files. We then learned about splitting by size and splitting by number of files.

As always, the example code used in this article is over on GitHub.

Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE
res – REST with Spring (eBook) (everywhere)
1 Comment
Oldest
Newest
Inline Feedbacks
View all comments
Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.