how to sync repositories on master and replicas

One of the important tasks in developing a fault-tolerant distributed system is synchronizing data on the master node with the slave node. From now on, we will call slave nodes replicas. There are many synchronization methods, and sometimes the one that takes into account the specifics of the stored data is more effective.

I am Roman Solovyov, a leading IT engineer in the RnD and Ready Solutions Department of the Product Development Department at SberTech. Today I will tell you how we synchronize Git repositories on two nodes, what alternatives exist, and why this is necessary at all.

What is this for?

Let's start from the end (because it's boring from the beginning). After many companies left the Russian market in 2022–2023, domestic businesses are increasingly thinking about creating their own architectural solutions. Own tools are convenient, safe, and reliable. Thus, SberTech created GitVerse — a platform for collaborative development and code hosting. So we at RnD asked ourselves: what scalable code storage systems could be integrated into any workflow?

We conducted a market analysis and found out that similar solutions exist in Bitbucket, Github and several other systems. More precisely, we are sure that they exist, otherwise the systems would simply not work. But no one has seen them and no one will see them: they are in no hurry to publish the source code and documentation in open source.

An obvious open source competitor is Gitaly, but its huge drawback is that it is glued and screwed to Gitlab. And integrating it into another code storage system that uses the Git protocol is a laborious and thankless process.

From these thoughts was born Gardener — a project of a fault-tolerant distributed storage system for Git repositories. We chose this name because the project indirectly works with Git trees and branches: it creates them, clones them, and transplants them to other nodes plots. And the main, but not the only tool for our gardener is the Git protocol, which is as versatile as a native shovel.

In the process of creating Gardener, we encountered an expected problem: we needed to somehow achieve the same state of repositories on the master and replicas. The bare git repository that we store on the node is just a directory with special files. Therefore, we had two options: to cleverly copy repositories as regular directories or to come up with something more optimal based on the specifics of the stored data. Below, I will tell you which path we chose and what methods we analyzed.

SCP

This utility is probably known even to novice programmers. Under the hood, it uses SSH and SFTP (with OpenSSH >= 9.0) and simply copies files from one machine to another. It is a powerful tool, but it is not suitable for our task, since in a commit, files are usually changed, not created or deleted. Accordingly, you need to get a list of changed files, delete previous versions, upload new ones (and also transactionally). The only advantage of this method compared to others is that scp is included in Linux by default – which, of course, is not an indicator.

Rsync

Rsync was originally created as a replacement for rcp and scp in cases where the receiving node already has a different version of the object. Mirroring is done with one stream in each direction, rather than one or more streams for each file. It uses a special delta‑transfer algorithm to copy the target file, which splits it into non-overlapping, fixed-size chunks on the receiving machine. It reads their hash sums, compresses them with zlib, and sends them to the node it synchronizes with. For our purposes, this utility is one of the best solutions.

Why not zsync?

Zsync is a tool similar to rsync that is optimized for multiple downloads per file version. We don't consider this option for several reasons:

  1. Gardener works primarily over SSH, and tying synchronization to HTTP means creating entities.

  2. The utility is mainly intended for distributing files from one server to many machines, whereas in our case the master node for the repository can change.

  3. The utility works mainly with large files (blob, iso, etc.)

Git fetch (via SSH)

And finally, the most logical and intuitive method is to simply do a Git fetch from the master to the replica. Just like rsync, Git detects changes in files and downloads only those by downloading special pack files. When downloading changes to a node, the modified Git objects are compressed using zlib, and the “loose” objects are transformed using Git gc. Then everything is packed into a pack file and downloaded to the node.

To summarize, rsync and fetch will compete for a “place in the sun”. To choose the best method, we conducted several tests. The scenario and results will be described in the following sections.

Experiment

For the sake of clarity of the result, we selected several cases:

  1. Uploading the repository to a clean node.

  2. Changing one large file (adding 100,000 lines to readme.md).

  3. Changing a large number of files (symbol a in files is replaced by the symbol b).

  4. Creating a heavy file (an archive with 1000 photos weighing 31 MB was used).

  5. Create 100 lightweight files.

We applied all the described points to the heavy and light repositories. NixOS was chosen as the heavy one (https://github.com/NixOS/nix.git), as a lightweight one – Toolchain Registry (https://github.com/yandex/toolchain‑registry.git). In point 3, 1552 files were changed in the heavy repository and 39 in the light one. The size of the heavy repository is 95 MB (1691 files), the light one is 860 KB (40 files).

Heavy Repository (nixOS)

No. experience

1

2

3 (1552 files)

4

5

rsync

1 m 45.40

0.472

4,374

0,230

0.330

fetch

1 m 37.77

0.257

1,232

0.895

0.182

Lightweight repository (Yandex Toolchain Registry)

No. experience

1

2

3 (1552 files)

4

5

rsync

0.178

0.179

0.187

0.231

0.177

fetch

0.154

0.156

0.181

0.797

0.159

It is clear that fetch copes better with downloading and changing a large number of files, since rsync makes changes file by file, and rsync works better with downloading large files. But on a small repository the difference is insignificant.

Conclusion

Both fetch and rsync do a great job. It's clear that the extreme cases described in the tests will probably be rare. But they still show that fetch does almost everything better. Although rsync downloads a lot of files faster, Git fetch is the preferred method for synchronizing repositories.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *