In this article, we’ll discuss parallel file archiving and compression in Linux symmetric multiprocessing systems. Additionally, a good comprehension of files and filesystems is needed to understand this article better.
2. What Is Symmetric Multiprocessing (SMP)
In the domain of computer multiprocessing, symmetric multiprocessing systems are those that are made to work with multiple processors that share the same memory and operating system. This means that each processor essentially shares the same resources.
The advantage of this architecture is that workloads can be balanced across machines. This means that for whatever process currently running, the data and resources of each machine in the network can be accessed independently of the central data path. Most modern operating systems support SMP. However, there is no point in using SMP unless whatever applications we choose to run are optimized for multi-threading.
Alternatively, other architectures exist. These options include:
- Massively Parallel Processing (MPP) systems that have processors that don’t share resources and can provide broader scalability than SMP systems, allowing each processor with its own OS and memory to process data in parallel.
- Asymmetric Multiprocessing (AMP) systems that don’t treat all processors similarly and could, in theory, rely on only one processor to run the operating system, for example:
Popular SMP applications for servers include SQL databases, FTP storage, Plex streaming, and other uses compatible with software multi-threading. Most operating systems support SMP.
3. Parallel File Operations on File Systems
Because multiprocessing (MP) systems can leverage multi-threading, compression and archiving files can be split over many disks, and this can cause a greater bottleneck to the point where the speed of the compression or archiving tools that we’re using is far slower than the speed of the information bus that the disks are connected to.
Data can be fragmented across multiple devices. Therefore we must employ different mechanics to store and compress files in MP systems.
Reddit’s data hoarding community exposes the different options when it comes to file server configurations for archiving. Specifically for parallel archiving and network-accessible storage in Linux file systems, there exist many different solutions. We can review some of the most popular options:
- BTRFS, a filesystem, and logical volume manager built by Oracle
- ZFS is available for Ubuntu as well as for other distributions. However, Linus Torvalds disapproves of this project for licensing issues.
- Unraid, a proprietary alternative, offers game server hosting, server data monitoring, container support, and more
- FreeNAS, a network storage system based on FreeBSD
These are only a few of the available options for setting up a network file system (NFS).
For compression that leverages all the processing cores in multi-threaded systems, we can use the following applications:
- 7-zip has a -mmt flag that supports multithreading
- Pbzip2 is a parallel implementation of bzip2
- Parallel XZ is a compression utility that takes advantage of running LZMA compression of different parts of an input file on multiple cores and processors simultaneously. This compression utility is compatible with xz compression in Linux
- PLZIP is another LZMA compression utility that leverages multiprocessing
- PIGZ is a parallel implementation of GZIP
- LRZIP is a compression tool that leverages LZMA and ZPAQ
Although a detailed analysis of different compression algorithms can be made, it is outside the scope of this article.
For comparison, a benchmark test was conducted amongst some of these options revealing the following results for a text file of 70KB:
|Method||File Size||% of Original|
We can review other tables for a speed comparison of these different algorithms.
In this article, we saw the different aspects of parallel archiving and compressing of files in Linux.