Working With Large Files in Linux

1. Introduction

Large files come about in many forms for users and system administrators alike. Their main benefits are encapsulation, centralization, and space efficiency. Yet, big files can also lead to many pitfalls.

In this tutorial, we explore ways to deal with large files in Linux. First, we check some common problems such files pose, as well as some of their solutions. After that, we describe tools that help when partially processing huge files. Next, several big file editors come into play with some statistics for each. Finally, we delve deeper into the subject with type optimizations and special physical memory usage.

We tested the code in the guide on Debian 10.10 (Buster) with GNU Bash 5.0.3 and Xfce 4.16. It is POSIX-compliant and should work in any such environment.

2. Large File Size Issues

File size categorization depends on the available hardware. In short, low machine specifications can cause issues with smaller “large files”. We’ll be using a machine with 8GB RAM and an SSD.

On average, let’s define big files as any files above 50MB. Furthermore, we’ll consider files above 1GB to be huge files. Working with such files presents challenges in many areas.

2.1. Transfer

Large file transfers take a long time. For example, a 5GB file would take around 8 minutes to transfer with the current world-average broadband download speed. The same file would require more than 17 minutes on mobile. If we need to download big files regularly, that’s a huge setback.

The obvious solution here is to increase the download speed. If that’s not an option, we can follow the suggestions in the next sections.

Of course, after a transfer, we must store the file somewhere.

2.2. Storage

Disk IO remains a bottleneck in modern computers. Partial or full loading of a huge file from the main drive into the much faster RAM takes time.

We can reduce that time if we replace old hard disk drives with modern technology solid-state drives. Alternatively, old disks can employ RAID.

Whatever the storage, operating systems partition it into separate file systems, which handle actual operations.

2.3. File System

Old file systems have many limitations. For example, FAT has a limit of 4GB. Separately, the bigger the files, the harder they are to distribute on different partitions of available space.

Finally, allocation unit sizes (blocks) play a big role when processing big files. They can lead to a large number of chunks, but also fragmentation — when a file is stored in many far-away blocks on the physical medium.

We should use contemporary file systems with good support for big files, such as Btrfs, Ext, and XFS. In contrast, ReiserFS is optimized for a large number of small files. Furthermore, we should use allocation unit sizes that follow best practices for big files. Regular defragmentation is also vital when dealing extensively with huge files.

Disregarding these pointers can slow down operations with the files in question. One such operation is reading.

2.4. Reading

When performing a data read, we must consider the available RAM and swap sizes. For instance, we can’t fit a whole 1GB file in 512MB of physical memory without swap. Even with swap, we may encounter thrashing. These are real concerns when working with huge files.

Short of adding RAM and configuring fast swap space, we can only choose the proper tools to read files. The burden is on the software to properly buffer and organize not only file reads but also edits.

2.5. Editing

Most of the time, editing presents many of the same challenges as reading. In addition, it causes seeks within the file after the initial load (read). We need seeks to find and change or add data. The seek operation can be costly, especially with heavy fragmentation.

Evidently, data organization within the file can relieve jumping around excessively. Ultimately, however, users and their tools decide how much of a problem file editing will present.

Let’s dive deeper into edits, as they combine all factors that huge files introduce in contrast to small ones.

3. Partial Processing

Multiple tools exist that can read and edit files of any size without issues. Some tools only use buffering, while others leverage partial reads, depending on available memory and swap.

We’ll be working with the same 12GB hugefile, composed of 500,000,000 lines in the format “This is line #LINE_NUMBER”:

$ time cat hugefile
This is line #0
This is line #1
[...]
This is line #499999999
This is line #500000000

real 106m25.480s
user 0m0.883s
sys 8m27.648s

Furthermore, we time all commands for easier comparisons like in the output above. Here’s a timing of the same file, output to another file instead of the screen:

$ time cat hugefile > hugefilecopy

real 2m28.119s
user 0m0.461s
sys 0m10.236s

Notice the times. We output to screen very slowly, while output to a file is much faster (by a factor of almost 50). This is critical for big files.

Importantly, caches and swap will be cleared after each command via a series of simple commands:

$ echo 3 > /proc/sys/vm/drop_caches
$ swapoff -a
$ swapon -a

3.1. split

The most basic way of dealing with huge files is to not have them in the first place. That doesn’t necessarily mean permanently restructuring data. We can just temporarily split the file in question:

$ ls -lh
total 12G
-rwxrwxrwx 1 x x 12G Aug 19 12:53 hugefile
$ time split --bytes=50M hugefile

real    2m19.451s
user    0m0.610s
sys     0m10.708s
$ ls -lh
total 23G
-rwxrwxrwx 1 x x 12G Aug 01 00:00 hugefile
-rwxrwxrwx 1 x x 50M Aug 01 00:00 xaa
[...]
-rwxrwxrwx 1 x x 50M Aug 01 00:03 xir
-rwxrwxrwx 1 x x 39M Aug 01 00:03 xis

Before the split, we can see the file was around 12GB. Splitting into 50MB chunks (–bytes=50M), we get multiple full 50MB files and one 39MB file. Their names are the split defaults.

Similarly, after we finish processing, we can rejoin the chunks via cat (concatenate) and redirection:

$ time cat x* > rejoinedhugefile

real 2m26.441s
user 0m0.437s
sys 0m10.016s
$ ls -lh rejoinedhugefile
-rwxrwxrwx 1 x x 12G Aug 01 00:00 rejoinedhugefile

This approach is used on many levels. For example, files can be automatically split during transfer. Of course, the network packets during transfer employ the same strategy.

As we’ll see below, we can leverage the same idea for reads and seeks.

3.2. head and tail

In case we already know the line or lines we want to see, head and tail just display them. For example, let’s get the first line of the file only (–lines=1):

$ time head --lines=1 hugefile
This is line #0

real 0m0.014s
user 0m0.000s
sys 0m0.005s

Similarly, we can get the last line:

$ time tail --lines=1 hugefile
This is line #500000000

real 0m0.014s
user 0m0.003s
sys 0m0.001s

Finally, we can also chain the commands to get an excerpt:

$ time tail --lines=100 hugefile | head --lines=3
This is line #499999901
This is line #499999902
This is line #499999903

real 0m0.013s
user 0m0.005s
sys 0m0.002s

Using head or tail yields similar times in almost all cases above, as they only differ by a small number of seek operations.

The way head and tail work ensures that the whole file is not loaded into memory at once. This is a common theme for many tools.

3.3. grep

We often don’t know the exact location of data we want to work with inside the file. In these cases, grep (globally search for a regular expression and print matching lines) can help.

Let’s do a complex regular expression search:

$ time grep --line-number --extended-regexp '[0-1]0000666[0-1]|126660002' hugefile
100006660:This is line #100006660
100006661:This is line #100006661
126660002:This is line #126660002

real    0m17.770s
user    0m9.192s
sys     0m2.821s

The –line-number flag ensures we only see matching lines and their line numbers. Additionally, the –extended-regexp flag allows the use of POSIX extended regular expressions.

Whereas grep can only search, there are other tools for when we want to perform targeted replacements.

3.4. sed

Although a very versatile tool, we can also use sed (stream editor) for in-place edits. For instance, to remove multiple different strings, we can run:

$ time sed --in-place --regexp-extended 's/[0-1]0000666[0-1]|126660002/xxx666xxx/g' hugefile

real    3m0.502s
user    0m52.071s
sys     0m13.875s
$ grep --line-number 'x' hugefile
100006660:This is line #xxx666xxx
100006661:This is line #xxx666xxx
126660002:This is line #xxx666xxx

Of course, despite being complex, such edits are fairly primitive in that they require prior knowledge of the data inside the file. In other words, they lack the context that other tools can provide.

3.5. less

The less command is a terminal pager. It functions similarly to its older and simpler relative, more. We use them to view, search, and scroll through a file, page by page.

The main benefit of terminal pagers is their simple buffering during reading and seek operations. In essence, they allow us to view an arbitrarily big file without burdening memory with more than several pages worth of data.

Both less and more have a keyboard shortcut (v), which starts the default system editor. Moreover, the editor loads the given file at the current line.

4. Full Editors

We can achieve complex targeted file editing via programming languages such as perl, python, and awk. However, such processing is beyond the scope of this article.

The usual tools for convenient file alterations are full-fledged editors. For example, they support:

reads, including pagination
writes, including editing
seeks, including searching
macros for complex operation chaining
mouse/keyboard shortcut bindings

Importantly, not all such editors are fit for huge file processing. We’ll discuss some of the ones that are below.

4.1. vi and gvim

The vi (visual) editor is standard for many Linux distributions. Its improved version is vim, and it has a graphical interface, gvim. All function equivalently for our purpose.

Opening hugefile (12GB) in (g)vi(m) takes around three minutes with SSD storage. Even without swap, the editor uses less than 60% of the 8GB RAM available on the machine. This is possible via the editor’s own swap file.

We can freely scroll through the file, seek to a line, or position and modify contents. The editor loads more data from the file as and when we request it. Consequently, enough physical memory remains for processing.

Nevertheless, limitations do exist. For example, vi searches only through the portions of the file already loaded from the disk. We have to choose either performance or usability. Indeed, saving the file back to disk after a minor edit takes around 5 minutes.

Such trade-offs are unavoidable for any editor, considering the hardware constraints.

4.2. joe

The joe (Joe’s own) editor is also commonly used. It’s advertised as being able to edit files larger than physical memory again due to its own swap file. It has a lot of overlapping functionality with vi.

Because of this and the equivalent hardware, joe has similar stats to vi during operations: opens hugefile in around 3 minutes and saves it in around 6 minutes.

Of course, similar functionality means similar limitations. In contrast, there are editors that allow a limited set of operations, but in-place, like with sed above.

4.3. Hex Editors

Not being able to append or delete data in a file can be an acceptable sacrifice for performance. Editors, which by default allow only read, search, seek and modify, are usually hex (from hexadecimal) editors.

They’re called hex editors because of the way that they typically represent data — using raw hexadecimal codes. Even so, most hex editors have a split-screen mode, where data is also interpreted and editable as ASCII.

Furthermore, the main benefit of hex editors is the fact that, by default, they don’t buffer the whole file into memory. Therefore, we have a limited operation set but optimized performance. Saving a file is extremely fast as only edited parts are replaced directly (in-place).

Alternatively, the hexedit (hexadecimal) editor, for example, can buffer the entire file into memory with the –buffer flag. Buffering this way allows append and delete operations but requires enough physical memory to hold the entire file.

5. RAM Coverage

Some commonly used editors require the whole file to fit in physical memory:

gedit (GNOME editor)
kate (KDE editor)
nano
emacs (editor macros)
mcedit (Midnight Commander editor)

Using these editors might be the best choice due to performance reasons: they’re much faster than using a (disk-based) swap. When plenty of RAM is available for the file we want to work with, the bottleneck will only be disk IO. In that case, we can use a special approach to leverage whole file buffering — a RAM drive.

RAM drives are used to speed up operations with especially big files. They allocate part of the physical memory and mount it on the filesystem. In the most simple case, to create a 1GB (-o size=1g) RAM drive and mount it on /mnt/ramdisk, we can use mount.tmpfs (virtual memory filesystem):

$ mkdir /mnt/ramdisk
$ mount -t tmpfs -o size=1g tmpfs /mnt/ramdisk

Any files stored at /mnt/ramdisk will be held in physical memory. Moving any created or modified files from this location to another directly on the disk can be considered “saving”.

6. Large File Types

Many of the discussed tools can skip large chunks of data if we know it to be irrelevant for a given edit. Knowing the contents of a file can drastically reduce the complexity and increase the performance of any operation.

To that end, the type of file we are dealing with is very important. Let’s explore some examples.

6.1. Logs

Log files usually have a fixed format. Their organization around date/time and severity allows for easy sorting, which is very useful when it comes to manual editing.

6.2. Databases

Databases are rarely housed in a monolithic file, but they can be. Even when they’re not a single file, their separate parts can be huge. Database organization is optimized for a specific kind of access, so employing the tools for this specific access can drastically decrease the burden of file size.

6.3. Files With Headers

Generally, many files have headers in the beginning. In essence, they contain information about the file, such as size, format and structure, and others. These headers work similarly to disk partition or filesystem information. The metadata headers can prove invaluable when traversing and editing files since they can enable us to pinpoint the necessary information without resource-heavy searches.

7. Summary

In this article, we looked at ways to handle large files in Linux.

First, after discussing some common problems and their solutions, we explored tools that help when partially processing huge files. After that, several big file editors were discussed. Finally, we noted some special cases with regard to memory and file types.

In conclusion, dealing with huge files is not as complex as it initially may seem, but certain pitfalls need to be considered.

Learn Java Collections

Learn Spring

Learn Maven

View All Courses

Administration

Scripting

Networking

Files

Processes

Full Archive

About Baeldung