1. Overview

Copying data is a daily part of system administration. We need it for backups, file organization, or data sharing.

rsync is a Unix tool commonly used for copying files. However, copying files sequentially doesn’t usually take full advantage of our computational resources. The ability to parallelize transfers can reduce transfer time substantially. This way, we can move several files at once, resulting in faster copying speeds.

In this tutorial, we’ll look at different methods to parallelize data transfer with rsync. The commands presented here are written with the Bash shell in mind, so they might not work with other shells.

2. Using parallel

One way to parallelize a command is with parallel. It runs multiple commands at once, taking advantage of available CPU cores in a system.

Let’s use parallel to create several instances of rsync:

$ parallel -j 3 --eta rsync -a {} destination/ ::: Downloads/*

Computers / CPU cores / Max jobs to run
1:local / 4 / 3

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
ETA: 0s Left: 0 AVG: 0.25s  local:0/40/100%/0.2s

First, we defined the maximum number of concurrent jobs with the -j flag. In this instance, at most 3 parallel processes were allowed to run. The –eta flag ensures we get progress updates in the output, including information on the number of jobs running and the estimated time left.

Then, we supplied rsync as the command that parallel should execute:

  • -a (archive) preserves several file properties like permissions and modification dates when copying
  • {} gets replaced by parallel with each input file in order
  • destination/ is the destination directory for the operation

Finally, the ::: separates the input from the command. The input goes through rsync line by line via {}. In this case, we’re selecting all the files in the Downloads directory with the * wildcard. Alternatively, we can provide input by piping the output of another command.

For each input file, parallel spawns a different process and concurrently runs up to as many as defined in the -j flag.

As we can imagine, this strategy can cause overhead by launching individual instances of rsync for each file and directory. This holds particularly true for input composed of a lot of small files. Small file transfer time can become so insignificant as to cause the ratio of process creation time to actual copy delay to increase substantially.

3. Launching Multiple rsync Sessions

Another way to parallelize rsync is to launch multiple processes with different inputs. To employ this strategy, we usually need to first break the input into different chunks.

Similarly to parallel, we can define how many concurrent instances we want by splitting the list of files into equal parts. For this, we can use split to divide the file list into smaller chunks by line.

3.1. Listing All Files

To do so, we use find, a search tool that can list, filter and execute commands on files:

$ find Downloads/ -type f -printf "%P\n" | split -n l/3 - split_

Here, find lists all files in the source directory Downloads and any subdirectories, since it uses recursion by default.

We use the -type flag to only output files (f) and not directories. The -printf flag changes the format of the output. Here, we’re using %P, which prints the file path relative to the starting directory, and \n to print a newline.

3.2. Dividing the File List Into Parts

Next, we pipe the output of find to split:

$ find Downloads/ -type f -printf "%P\n" | split -n l/3 - split_

The split command divides the file list into several parts, with an equal number of lines.

We use the -n flag to define the number of parts. In this case, we define l/3 to divide the output into 3 different files, without splitting any lines.

The is the Bash placeholder for standard input, so split reads its data from there. After the input, we define the prefix of the files. In the example, we use split_.

Now, we should have three different files, as we can see with ls:

$ ls
destination/  Downloads/  split_aa  split_ac
Documents/    Pictures/   split_ab  Videos/

split names each consecutive output file by adding suffixes. These go from aa to zz, as we can see in our case (split_aa, split_ab, split_ac). If needed, we could add the -d flag to change the suffixes to numeric characters (e.g., split_00).

3.3. Launching rsync as a Background Process

Let’s launch different rsync processes for each of the files we generated:

$ for f in split_*; do rsync -a --files-from="$f" Downloads/ destination/ & done; wait
[1] 800916
[2] 800917
[3] 800919
[1]   Done                    rsync -a --files-from="$f" Downloads/ destination/
[2]-  Done                    rsync -a --files-from="$f" Downloads/ destination/
[3]+  Done                    rsync -a --files-from="$f" Downloads/ destination/

First, we created a for loop that loops through the split_ files by using the * wildcard.

In the loop, we use the –files-from flag to provide the current split_ file as input. Afterward, we define the source and destination directories. The & runs the process in the background. This way, the loop doesn’t wait for the instruction to finish and launches the rsync instances concurrently.

Finally, we use wait to halt the execution of the program until all launched instances finish. This way, the program ends only when all the background processes are done.

In the output, we can see the process ID of the launched jobs and a message when they finish.

4. Conclusion

When we’re copying large amounts of data, concurrency maximizes the use of our computational resources. This way, we can speed up our transfers by copying more files at the same time.

In this article, we looked at how we can parallelize the rsync command. We analyzed different approaches, from the parallel command to running different processes in the background. While the first approach is usually simpler, the second creates less overhead and, therefore, can be faster.

Finally, we also learned more about the parallel, rsync, and split commands, by looking at their usage and flags.

Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.