Large files come about in many forms for users and system administrators alike. Their main benefits are encapsulation, centralization, and space efficiency. Yet, big files can also lead to many pitfalls.
In this tutorial, we explore ways to deal with large files in Linux. First, we check some common problems such files pose, as well as some of their solutions. After that, we describe tools that help when partially processing huge files. Next, several big file editors come into play with some statistics for each. Finally, we delve deeper into the subject with type optimizations and special physical memory usage.
We tested the code in the guide on Debian 10.10 (Buster) with GNU Bash 5.0.3 and Xfce 4.16. It is POSIX-compliant and should work in any such environment.
2. Large File Size Issues
File size categorization depends on the available hardware. In short, low machine specifications can cause issues with smaller “large files”. We’ll be using a machine with 8GB RAM and an SSD.
On average, let’s define big files as any files above 50MB. Furthermore, we’ll consider files above 1GB to be huge files. Working with such files presents challenges in many areas.
Large file transfers take a long time. For example, a 5GB file would take around 8 minutes to transfer with the current world-average broadband download speed. The same file would require more than 17 minutes on mobile. If we need to download big files regularly, that’s a huge setback.
The obvious solution here is to increase the download speed. If that’s not an option, we can follow the suggestions in the next sections.
Of course, after a transfer, we must store the file somewhere.
Disk IO remains a bottleneck in modern computers. Partial or full loading of a huge file from the main drive into the much faster RAM takes time.
We can reduce that time if we replace old hard disk drives with modern technology solid-state drives. Alternatively, old disks can employ RAID.
Whatever the storage, operating systems partition it into separate file systems, which handle actual operations.
2.3. File System
Old file systems have many limitations. For example, FAT has a limit of 4GB. Separately, the bigger the files, the harder they are to distribute on different partitions of available space.
Finally, allocation unit sizes (blocks) play a big role when processing big files. They can lead to a large number of chunks, but also fragmentation — when a file is stored in many far-away blocks on the physical medium.
We should use contemporary file systems with good support for big files, such as Btrfs, Ext, and XFS. In contrast, ReiserFS is optimized for a large number of small files. Furthermore, we should use allocation unit sizes that follow best practices for big files. Regular defragmentation is also vital when dealing extensively with huge files.
Disregarding these pointers can slow down operations with the files in question. One such operation is reading.
When performing a data read, we must consider the available RAM and swap sizes. For instance, we can’t fit a whole 1GB file in 512MB of physical memory without swap. Even with swap, we may encounter thrashing. These are real concerns when working with huge files.
Short of adding RAM and configuring fast swap space, we can only choose the proper tools to read files. The burden is on the software to properly buffer and organize not only file reads but also edits.
Most of the time, editing presents many of the same challenges as reading. In addition, it causes seeks within the file after the initial load (read). We need seeks to find and change or add data. The seek operation can be costly, especially with heavy fragmentation.
Evidently, data organization within the file can relieve jumping around excessively. Ultimately, however, users and their tools decide how much of a problem file editing will present.
Let’s dive deeper into edits, as they combine all factors that huge files introduce in contrast to small ones.
3. Partial Processing
Multiple tools exist that can read and edit files of any size without issues. Some tools only use buffering, while others leverage partial reads, depending on available memory and swap.
We’ll be working with the same 12GB hugefile, composed of 500,000,000 lines in the format “This is line #LINE_NUMBER”:
$ time cat hugefile This is line #0 This is line #1 [...] This is line #499999999 This is line #500000000 real 106m25.480s user 0m0.883s sys 8m27.648s
Furthermore, we time all commands for easier comparisons like in the output above. Here’s a timing of the same file, output to another file instead of the screen:
$ time cat hugefile > hugefilecopy real 2m28.119s user 0m0.461s sys 0m10.236s
Notice the times. We output to screen very slowly, while output to a file is much faster (by a factor of almost 50). This is critical for big files.
Importantly, caches and swap will be cleared after each command via a series of simple commands:
$ echo 3 > /proc/sys/vm/drop_caches $ swapoff -a $ swapon -a
The most basic way of dealing with huge files is to not have them in the first place. That doesn’t necessarily mean permanently restructuring data. We can just temporarily split the file in question:
$ ls -lh total 12G -rwxrwxrwx 1 x x 12G Aug 19 12:53 hugefile $ time split --bytes=50M hugefile real 2m19.451s user 0m0.610s sys 0m10.708s $ ls -lh total 23G -rwxrwxrwx 1 x x 12G Aug 01 00:00 hugefile -rwxrwxrwx 1 x x 50M Aug 01 00:00 xaa [...] -rwxrwxrwx 1 x x 50M Aug 01 00:03 xir -rwxrwxrwx 1 x x 39M Aug 01 00:03 xis
Before the split, we can see the file was around 12GB. Splitting into 50MB chunks (–bytes=50M), we get multiple full 50MB files and one 39MB file. Their names are the split defaults.
$ time cat x* > rejoinedhugefile real 2m26.441s user 0m0.437s sys 0m10.016s $ ls -lh rejoinedhugefile -rwxrwxrwx 1 x x 12G Aug 01 00:00 rejoinedhugefile
This approach is used on many levels. For example, files can be automatically split during transfer. Of course, the network packets during transfer employ the same strategy.
As we’ll see below, we can leverage the same idea for reads and seeks.
3.2. head and tail
$ time head --lines=1 hugefile This is line #0 real 0m0.014s user 0m0.000s sys 0m0.005s
Similarly, we can get the last line:
$ time tail --lines=1 hugefile This is line #500000000 real 0m0.014s user 0m0.003s sys 0m0.001s
Finally, we can also chain the commands to get an excerpt:
$ time tail --lines=100 hugefile | head --lines=3 This is line #499999901 This is line #499999902 This is line #499999903 real 0m0.013s user 0m0.005s sys 0m0.002s
Using head or tail yields similar times in almost all cases above, as they only differ by a small number of seek operations.
The way head and tail work ensures that the whole file is not loaded into memory at once. This is a common theme for many tools.
We often don’t know the exact location of data we want to work with inside the file. In these cases, grep (globally search for a regular expression and print matching lines) can help.
Let’s do a complex regular expression search:
$ time grep --line-number --extended-regexp '[0-1]0000666[0-1]|126660002' hugefile 100006660:This is line #100006660 100006661:This is line #100006661 126660002:This is line #126660002 real 0m17.770s user 0m9.192s sys 0m2.821s
The –line-number flag ensures we only see matching lines and their line numbers. Additionally, the –extended-regexp flag allows the use of POSIX extended regular expressions.
Whereas grep can only search, there are other tools for when we want to perform targeted replacements.
$ time sed --in-place --regexp-extended 's/[0-1]0000666[0-1]|126660002/xxx666xxx/g' hugefile real 3m0.502s user 0m52.071s sys 0m13.875s $ grep --line-number 'x' hugefile 100006660:This is line #xxx666xxx 100006661:This is line #xxx666xxx 126660002:This is line #xxx666xxx
Of course, despite being complex, such edits are fairly primitive in that they require prior knowledge of the data inside the file. In other words, they lack the context that other tools can provide.
The main benefit of terminal pagers is their simple buffering during reading and seek operations. In essence, they allow us to view an arbitrarily big file without burdening memory with more than several pages worth of data.
Both less and more have a keyboard shortcut (v), which starts the default system editor. Moreover, the editor loads the given file at the current line.
4. Full Editors
The usual tools for convenient file alterations are full-fledged editors. For example, they support:
- reads, including pagination
- writes, including editing
- seeks, including searching
- macros for complex operation chaining
- mouse/keyboard shortcut bindings
Importantly, not all such editors are fit for huge file processing. We’ll discuss some of the ones that are below.
4.1. vi and gvim
The vi (visual) editor is standard for many Linux distributions. Its improved version is vim, and it has a graphical interface, gvim. All function equivalently for our purpose.
Opening hugefile (12GB) in (g)vi(m) takes around three minutes with SSD storage. Even without swap, the editor uses less than 60% of the 8GB RAM available on the machine. This is possible via the editor’s own swap file.
We can freely scroll through the file, seek to a line, or position and modify contents. The editor loads more data from the file as and when we request it. Consequently, enough physical memory remains for processing.
Nevertheless, limitations do exist. For example, vi searches only through the portions of the file already loaded from the disk. We have to choose either performance or usability. Indeed, saving the file back to disk after a minor edit takes around 5 minutes.
Such trade-offs are unavoidable for any editor, considering the hardware constraints.
The joe (Joe’s own) editor is also commonly used. It’s advertised as being able to edit files larger than physical memory again due to its own swap file. It has a lot of overlapping functionality with vi.
Because of this and the equivalent hardware, joe has similar stats to vi during operations: opens hugefile in around 3 minutes and saves it in around 6 minutes.
Of course, similar functionality means similar limitations. In contrast, there are editors that allow a limited set of operations, but in-place, like with sed above.
4.3. Hex Editors
Not being able to append or delete data in a file can be an acceptable sacrifice for performance. Editors, which by default allow only read, search, seek and modify, are usually hex (from hexadecimal) editors.
They’re called hex editors because of the way that they typically represent data — using raw hexadecimal codes. Even so, most hex editors have a split-screen mode, where data is also interpreted and editable as ASCII.
Furthermore, the main benefit of hex editors is the fact that, by default, they don’t buffer the whole file into memory. Therefore, we have a limited operation set but optimized performance. Saving a file is extremely fast as only edited parts are replaced directly (in-place).
Alternatively, the hexedit (hexadecimal) editor, for example, can buffer the entire file into memory with the –buffer flag. Buffering this way allows append and delete operations but requires enough physical memory to hold the entire file.
5. RAM Coverage
Some commonly used editors require the whole file to fit in physical memory:
- gedit (GNOME editor)
- kate (KDE editor)
- emacs (editor macros)
- mcedit (Midnight Commander editor)
Using these editors might be the best choice due to performance reasons: they’re much faster than using a (disk-based) swap. When plenty of RAM is available for the file we want to work with, the bottleneck will only be disk IO. In that case, we can use a special approach to leverage whole file buffering — a RAM drive.
RAM drives are used to speed up operations with especially big files. They allocate part of the physical memory and mount it on the filesystem. In the most simple case, to create a 1GB (-o size=1g) RAM drive and mount it on /mnt/ramdisk, we can use mount.tmpfs (virtual memory filesystem):
$ mkdir /mnt/ramdisk $ mount -t tmpfs -o size=1g tmpfs /mnt/ramdisk
Any files stored at /mnt/ramdisk will be held in physical memory. Moving any created or modified files from this location to another directly on the disk can be considered “saving”.
6. Large File Types
Many of the discussed tools can skip large chunks of data if we know it to be irrelevant for a given edit. Knowing the contents of a file can drastically reduce the complexity and increase the performance of any operation.
To that end, the type of file we are dealing with is very important. Let’s explore some examples.
Log files usually have a fixed format. Their organization around date/time and severity allows for easy sorting, which is very useful when it comes to manual editing.
Databases are rarely housed in a monolithic file, but they can be. Even when they’re not a single file, their separate parts can be huge. Database organization is optimized for a specific kind of access, so employing the tools for this specific access can drastically decrease the burden of file size.
6.3. Files With Headers
Generally, many files have headers in the beginning. In essence, they contain information about the file, such as size, format and structure, and others. These headers work similarly to disk partition or filesystem information. The metadata headers can prove invaluable when traversing and editing files since they can enable us to pinpoint the necessary information without resource-heavy searches.
In this article, we looked at ways to handle large files in Linux.
First, after discussing some common problems and their solutions, we explored tools that help when partially processing huge files. After that, several big file editors were discussed. Finally, we noted some special cases with regard to memory and file types.
In conclusion, dealing with huge files is not as complex as it initially may seem, but certain pitfalls need to be considered.