How to organize a cost-effective backup using hard links

Egor OrlovI have been in IT for over 24 years, I teach at SPbPU and write for the media ENTER. In this article, we will analyze what hard links are in UNIX-like operating systems and how they can be used. Namely, how they can be used to significantly save space when backing up data, creating backup copies that are incremental copies in terms of the space they take up, and are similar to full backup copies in terms of ease of access to data.

File index descriptors

Let's dive a little into the peculiarities of organizing file storage in file systems of UNIX-like operating systems and, in particular, Linux. Let's start with a description of a file, which technically consists of three parts stored separately from each other.

The first part is the file name

The name, together with the file's inode identifier, is stored in the directory to which the file belongs. A directory is a special type of file with a table of file names and their inode numbers.

You can view the contents of the directory using the command lsbut to make it display the contents of the directory itself, rather than the metadata, run it with this set of parameters:

# ls -1ia /var/log | head

352 .    

267 .. 112716 ahttpd

1905346 alterator-net-iptables

114533 audit

5842784 btmp

5842785 btmp.1.xz  

75955 chrony

3025653 clamav

259603 cups

The second part is the file metadata.

They store file access modes, owner information, creation and modification dates. You can view the contents of a file's metadata using the command stat:

# stat /etc/passwd

Файл: /etc/passwd

Размер: 3847          Блоков: 8          Блок В/В: 4096   обычный файл

Device: 0,28    Inode: 5033772     Links: 1

Доступ: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)

Доступ:        2024-06-26 08:25:52.076518894 +0300

Модифицирован: 2024-02-27 13:29:57.552367529 +0300

Изменен:       2024-02-27 13:29:57.553367531 +0300

Создан:        2024-02-27 13:29:57.551367526 +0300

Everything but the first line of output from this command is the content of the metadata. It is stored in the file's inode, whose number is linked to the file's name through an entry in the directory structure. This is why the directory entry points to the inode containing the file's metadata—this link between the file's name and its inode is called a hard link.

Index descriptors are located in a special part of the file system called the index descriptor table. Each descriptor with metadata of a specific file is a record of this table, which can be conventionally called a row. Most of the file system is data blocks.

Part three – contents

The contents are stored in file system data blocks. The list of blocks with file data is stored in its index descriptor.

Thus, the name is associated with the inode, and the inode is associated with the data blocks. All three parts of the file are stored separately, but logically related to each other.

Hard links

The same inode can be pointed to by multiple entries of the same or different directories, that is, the metadata and stored data blocks of the same file can be represented by different path names.

Let's say we have a file file1. We can create a hard link for it with the command ln:

$ ln file1 link1

$ ls -il

итого 8

12525 -rw-r--r-- 2 egor egor 42 июн 27 12:46 file1

12525 -rw-r--r-- 2 egor egor 42 июн 27 12:46 link1

Option -i teams ls allows you to display the index descriptor numbers of the displayed files. In the listing above, we can see these numbers in the first column. Note that they are the same for the file file1 and created by the team ln hard link link1. So file1 – this is the same hard link to metadata as link1and these links are absolutely equal.

The number of links to a file is stored in the file's metadata, which can be seen in the output. ls after displaying the access modes. In our case, there is now a number 2.

Working with the file contents is possible through any of the hard links. Any read/write operations work through all path names. If you delete one hard link, this will decrease their counter, but all the others will continue to work. And when the last hard link is deleted and no one else refers to the file metadata, the corresponding entry in the index descriptor table and the data blocks are marked as free.

Hard links can be created for objects in other directories. The only limitation is that they will only work within the same file system.

There are various useful applications of the hard link mechanism. Let's look at one of them.

Using hard links when backing up data

Let's say we have a data directory that needs to be backed up every day. If we do this by regular copying, the backup size will be determined as N × S. In this formula, N is the number of backups stored, and S is the space occupied by the data directory.

In reality, not all files that need to be backed up will change every day. Most likely, the opposite is true: most data does not change for weeks, because users actively work only with their current projects, without changing other files.

The logic for backing up using hard links is as follows:

  • We make the first backup copy by simply copying the original data.

  • For the second backup copy, we copy only the files that have changed in relation to the contents of the first backup. And those files that remain unchanged are replaced in the new backup copy with hard links to the files in the first copy.

  • During subsequent backups, only the changed data is copied in relation to the previous backup, the unchanged data is filled with hard links.

With this approach we get three advantages over the classic backup scheme:

  1. Significant savings in disk space. Hard links use only an entry in the inode table, without occupying data blocks. The storage space required for backups is the same as if a single full copy and subsequent incremental copies were used.

  2. The ability to access data in backup copies and restore data is just as convenient. All hard links are equal and allow access to file data, so when accessing any of the regular backup copies, we see all the saved data.

  3. The backups accumulated in this way can also be deleted regularly, starting with the oldest ones. The data will remain accessible via hard links in more recent backups.

Using rsync for backup

Let's look at an example of implementing the above approach. Below is the simplest script with a utility rsync and options to create hard links for data that is unchanged from the previous copy.

#!/bin/sh

$OPT=-avHAX

TGT=$2/$(date +%F--%H-%M-%S)

LINK=--link-dest=../last

rsync $OPT $LINK $1 $TGT

LAST=$2/last

rm -f $LAST

ln -s $TGT $LAST

The script accepts two positional parameters: the directory with the original data and the directory for storing backup copies. In the backup copy directory, subdirectories are created with names corresponding to the current date (TGT variable). To make it easier to find previous copies during subsequent runs of the script, we will set a symbolic link last at the end to the directory with the newly created copy. The previous link is deleted.

The following utility parameters are used rsync:

  • -a — standard archiving option;

  • -H – copy hard links as hard links;

  • -A — copy Posix ACL attributes;

  • -X — copy SELinux context;

  • -v — output a list of files and a short summary;

  • –link-dest=../last — directory with previous backup to create hard links instead of copying unchanged files.

What to consider before implementation

In order for hard links to work correctly in real conditions, the example above needs to be modified to suit your needs – in this form, it is only a prototype. Pay attention to ready-made backup scripts built on the basis of rsync and the use of hard links. One of the most famous of these ready-made solutions is the script rsnapshot.

ENTER — DIY media for IT professionals. Share personal stories about solving various IT problems and get rewarded.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *