Perform Incremental Backups in Linux

1. Introduction

In this tutorial, we’ll look at how to perform incremental backups on Linux systems. We will first review what an incremental backup is, and then how to set it up and execute it.

2. Incremental Backups

First, let’s understand why incremental backups exist in the first place.

2.1. Motivation

Backups are vital in the event of any type of hardware, system, or process failure.

Databases are a good example to illustrate the importance of backups. If the wrong database update gets deployed, the disk fails, or the data becomes corrupted for some reason, we need a backup to restore our data. However, a common problem is that the system we’re backing up might be quite large. For instance, databases can contain terabytes of data, and if we’re backing it up each evening, we’ll run out of space for these backups quickly.

Fortunately, an incremental backup helps in this situation by allowing us to back up only the data that has changed since the last backup.

2.2. Incremental Backup Benefits

An incremental backup allows us to capture changes between backups.

Let’s assume that initially, we have a database of two terabytes. The very first backup will be a full backup of the entire database. Afterward, we only need to backup changes from the last backup. Subsequent backups don’t need to save the whole database again.

So, incremental backups allow us to save disk space and network usage by intelligently saving recent changes to our backup target.

2.3. Incremental Backup Challenges

An incremental backup can be tricky to restore because we’ll need to reapply all changes in the precise order they were saved initially. In contrast, a complete backup only requires the most recent save to recover.

Often, we’ll also need special software to restore incremental backups.

3. Implementation via rsync

The rsync utility is one of the most common ways to implement incremental backup. Its operation is quite simple: It just ensures that the files in the destination match the files in the source. With a little bit of code and the help of various flags, we can implement incremental backups via rsync. Let’s dive in.

3.1. Local Backup

For simplicity, we assume we’re doing a backup within one machine. In real-world systems, it’s often not true, but don’t worry, we’ll revisit this example a bit later to perform a backup onto a remote host.

Before we create any backups, we should double-check the state of the source and destination directories for our backup. To that effect, let’s write a script that verifies the source and destination directories and separates incremental backup slices by date in the backup destination. Finally, we’ll also be creating the LATEST symbolic link to our last backup, as a means of helping us determine if we’re handling the first or incremental backup:

#!/bin/bash

TODAY=`date +%Y-%m-%d`
SRC="$1"
DEST="$2"
LATEST="latest"

if [[ -z "$SRC" ]]
then
    echo "[ERROR] The source directory parameter is absent"
    exit 1
fi

if [[ -z "$DEST" ]]
then
    echo "[ERROR] The destination parameter is absent"
    exit 1
fi

TARGET="${DEST}/${TODAY}"
echo -e "[DEBUG] Initiating the backup from ${SRC} to ${DEST} for ${TODAY}."

if [[ ! -d "${TARGET}" ]]
then
    mkdir "${TARGET}"
    echo -e "[DEBUG] Backup target : ${TARGET}"
elif [[ ! -z `find "${TARGET}" -type d -empty &>/dev/zero` ]]
then
    echo "[ERROR] Backup target exists and contains some files. Aborting the backup"
    exit 1
else
    echo "[DEBUG] Backup target already exists"
fi

If the target directory does exist and is not empty, that is highly probable that we have some misconfiguration, and therefore the process that runs the script exits. The LATEST variable will play a crucial role in making the backups incremental because it represents the symbolic link to the latest incremental backup data. The absence of the LATEST directory means we have not done any backups yet.

Now, once we’ve done the setup, we can finally review the code that does the backup itself.

3.2. Usage of rsync

So, the usage of the rsync itself will look like this:

OPTS="-azvP --mkpath --delete"

if [[ ! -L "${DEST}/${LATEST}" ]] 
then
    echo "[WARN] Latest dir was not found in ${DEST}. Performing initial complete backup."
else 
    OPTS="${OPTS} --link-dest ${DEST}/${LATEST}"
fi

RESULT_OF_RSYNC=`rsync ${OPTS} ${SRC} ${TARGET}`

echo "$RESULT_OF_RSYNC"

PREV_RES=$?

if [[ ${PREV_RES} -eq "0" ]]
then
    echo -e "[DEBUG] Backup completed successfully"
    `rm -f ${LATEST}`
    `ln --symbolic ${TARGET} ${LATEST}`
else
    echo -e "[ERROR] There is an error during backup"
exit "${PREV_RES}"
fi

Let’s review the snippet of the script above in detail. First, we’re setting a couple of options for rsync since we’ll need to perform an incremental backup. Let’s review them quickly:

The -a flag forces rsync to perform a recursive backup. It means that rsync will consider not only the SRC directory for backup but all of its subdirectories with their content. This flag also instructs the rsync to preserve the timestamps and permissions of files/directories we’re backing up
The -v flag increases the verbosity of the output
The -P flag guides rsync to show the progress of the sync
The -z flag instructs rsync to compress the input during the transfer
–mkpath tells rsync to create missing directories on the destination side if some of them don’t exist.
The option –link-dest represents the directory, which serves as a baseline for the rsync file comparison. In other words, rsync compares files in SRC with files in the directory marked with –link-dest flag. We add this option only in case the LATEST directory is present.
Finally, the –delete flag does also a very important job in the sense that it specifies that the files and directories — the ones that are present on the destination side and not present on the sending side — should be removed from the former.

Now, we’re checking if the LATEST symbolic link exists in the destination directory. If it does, then we’re adding –link-dest to rsync. If the LATEST symbolic link is not there, we’re assuming we need to perform the initial complete backup. Next, we’re going to review how rsync does the incremental backup.

4. Detailed Sync Explanation

So, as mentioned, the missing LATEST directory is a sign that we have not backed up anything yet. In this case, we’re not specifying the –link-dest, which is the baseline that rsync compares files from the SRC. Therefore, the script does the full backup into the TARGET directory and then attaches the symbolic link to it. The next time this script runs, the LATEST symlink will be here, and rsync will only back up new and changed files.

Here, it’s vital to understand that each new incremental backup directory would be the “snapshot” of all files that are present in the source. However, this “snapshot” will contain the hard links to the previous backup (previous LATEST) of files that have not changed, and it will contain the real brand-new files that have been added or changed. This seems a bit sophisticated, but it is a very important concept. It also makes handling deletions very straightforward, in the sense that deleted files just are omitted in the new version of the incremented backup. It also allows retaining the file in the previous backups, in case we need to access it.

The question is now – how does rsync compare files?

5. Changes Detection

So now let’s understand how rsync detects changes in files. If the file is present in SRC and absent in DEST, this is a new file. Likewise, in case a file is present in DEST and absent in SRC, this is a deleted file. However, detecting changes is a bit trickier.

By default, rsync takes a “quick check” approach. This means rsync compares the last modification timestamps and sizes of files in both source and destination. In the majority of use cases, this approach works fine. However, it does not always work. It is possible to modify the file on Linux, retaining the same size of the file in bytes, along with the same modification timestamp. For that, rsync has the checksum approach. This solution works in all cases (neglecting hash collisions), but it also takes significantly longer and requires more disk I/O.

However, it is worth mentioning that the checksum collisions can take place, but they are extremely rare.

6. rsync with a Remote Destination

It is worth mentioning that rsync can backup files to a remote server, but there are a few things to consider. First, the server we are backing up onto must have rsync installed and configured. Second, rsync, by default, works on top of ssh. That means, that we must ensure that an ssh server is running on the remote server and that we have access to that remote server. Assuming this is ready, we can change the rsync command to account for the remote destination and include the ssh credentials:

rsync -e "ssh -i /path/to/private/key" ${SRC} [email protected]:${TARGET}

As we can see, all the options remained the same. The TARGET directory resides on the remote server. The -e option specifies the remote shell program. Although ssh is the default remote shell program, we have to declare it to also specify the private key file that ssh will use to communicate with the remote peer. Finally, we’re specifying the hostname, user, and IP address of the remote host.

Note, that regardless of usage, if rsync process is killed, then it might leave the destination directory in the inconsistent state. So we need to be aware of that and handle such cases respectively.

7. Conclusion

In this article, we have reviewed an incremental backup strategy. It allows us to spare disk space by only storing changes since the last backup. However, the downside is that the restoration process from incremental backups is quite complex. To perform incremental backups we used rsync which implements the incremental backup strategy.

As always, the code from this article can be found over on GitHub.

Administration

Scripting

Networking

Files

Processes

Full Archive

About Baeldung