About the arrogant client, or monitoring borg backup in prometheus on the knee

I have one server in the Hetzner cloud, I needed to make a backup from it to a storage box, Hetzner has such online storage.

The storage box supports connections via 22 and 23 ports (this is important for further narration)

Hetzner himself recommends using borg for my task, but why not. I tried it, it works, it deduplicates something, compresses it, doesn’t ask for food.

I wrote two bash scripts. The first one initializes borg, the second one performs backup. Nothing special, but in the second script, first borg connects and creates a new backup, and then the sftp client connects, just in case, to check how much free space is left.

In this case, Borg connects via ssh protocol to port 23 (Hetzner calls this extended ssh service). But the sftp client uses port 22 by default.

The Ssh key for connections to both ports is the same. The public key is pre-loaded in a special way into the storage box. The private key was placed here ~/.ssh/id_rsa on the server.

If you do everything manually, then after the first connection to the storage box you will need fingerprint confirmation:

Are you sure you want to continue connecting (yes/no/[fingerprint])?

When we agree, a line for the storage box will be added to the ~/.ssh/known_hosts file, and further connections will proceed without any questions asked.

If you want to automate the installation of the backup script, you can add the required line to known_hosts in advance. (I do this, like all installation and configuration, using the ansible role)

And now attention, a question for the experts! Can one host have different fingerprints for different ports? And if so, what would it look like in the known_hosts file?

But let's move on. So, I added a line to known_hosts in advance:

myuser.your-backup.de ssh-rsa AAAA%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%...

and quickly realized that borg was still asking for fingerprint verification. What does fingerprints have to do with ssh-ed25519.

Question number two for the experts. If we connect to a host with one key type (ssh-rsa), can the client request fingerprint confirmation for a different key type (ssh-ed25519 for example)?

I admit that at first I didn’t bother to figure out why this was happening, I just added a second line to known_hosts:

myuser.your-backup.de ssh-rsa AAAA%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%...
myuser.your-backup.de ssh-ed25519 AAAA%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%...

And everything worked.

But it didn't work for long. One day the script stopped creating backups. And I noticed this by accident. Well, yes, the truth of life. And yes, nothing was monitored.

It took some time to discover that the second line had magically disappeared from the known_hosts file.

Someone carefully created a backup copy of my file – known_hosts.old, a new known_hosts appeared, without the ssh-ed25519 key.

After poking, as it happens, in different directions, I found the culprit. Having increased the log-level of the sftp client, I saw the following:

$ sftp -4 -vvv -i ~/.ssh/id_rsa myuser@myuser.your-backup.de
...
debug3: hostkeys_foreach: reading file "~/.ssh/known_hosts"
debug3: hostkeys_find: found ssh-rsa key at ~/.ssh/known_hosts:1
debug3: hostkeys_find: deprecated ssh-ed25519 key at ~/.ssh/known_hosts:2
..
debug3: client_input_hostkeys: 1 server keys: 0 new, 1 retained, 0 incomplete match. 1 to remove
debug2: check_old_keys_othernames: checking for 1 deprecated keys
debug3: check_old_keys_othernames: searching ~/.ssh/known_hosts for myuser.your-backup.de / (none)
debug3: hostkeys_foreach: reading file "~/.ssh/known_hosts"
...
Deprecating obsolete hostkey: ED25519 SHA256:%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%5
debug3: hostkeys_foreach: reading file "~/.ssh/known_hosts"
debug3: host_delete: RSA key already at ~/.ssh/known_hosts:1
~/.ssh/known_hosts:2: Removed ED25519 key for host myuser.your-backup.de
...

I admit, I couldn’t imagine that a client could so shamelessly start deleting fingerprints from known_hosts.

Connecting to the host on port 22, the sftp client did not see a match to the fingerprints for ssh-ed25519, considered them irrelevant and deleted them. And he doesn’t care that an entry is needed for another port of the same host!

Now attention, question to experts number 3. Why could the script work for some time? (I don't know the answer to this question myself)

Okay, let's move on. I solved the problem by adding two lines to the ~/.ssh/config file:

Host myuser.your-backup.de
    UpdateHostKeys no

Now it's time for what should have been done right away – monitoring.

Writing an exporter to monitor backups in prometheus

I decided to make the most simple, but at the same time functional solution for monitoring backups. Namely, write a simple exporter for prometheus, especially since there is already a working server.

Let's go. Here's the bash code:

#!/bin/bash
while true; do
    {
        echo -e "HTTP/1.1 200 OK\nContent-Type: text/plain\n"
        cat /path/to/file_with_metrics
    } | nc -l -p 9999 -q 1
done

What do we have here? An infinite loop that listens on port 9999 and sends the text from our metrics file there, complete with HTTP headers.

We save the script, for example, in /usr/local/bin/borg_exporter.sh We create a systemd unit that will run it:

[Unit]
Description=Borgbackup prometheus exporter
After=network-online.target

[Service]
ExecStart=/usr/local/bin/borg_exporter.sh
Type=simple
Restart=always

Save it in /etc/systemd/system/borg_exporter.service.

Let's activate our new service:

sudo systemctl daemon-reload
sudo systemctl enable borg_exporter
sudo systemctl start borg_exporter

Yes, you need to remember to make the script launchable, and yes, you need to run all this not as root, and yes, you need to open a port on the firewall, etc. …

But this simple solution can actually work with Prometheus!

Add to its settings (in prometheus.yml, in scrape_configs):

- job_name: borg
  static_configs:
    - targets:
      - 1.2.3.4:9999 # IP of your server
  scheme: http

where 1.2.3.4 is the server IP address (if prometheus is running on the same server, then replace it with localhost)

We reboot promehteus, and if the file /path/to/file_with_metrics contains at least one line in the format

metric_name value

then this metric will be read by them. Value can only be an integer or a floating point number.

All that remains for us is to configure the update of the necessary metrics in the file, as well as alerts.

Let's return to my main script, which launches the backup, and yes, the code for which I did not provide here.

Let me discard the unnecessary stuff and leave only what concerns monitoring.

#!/bin/bash
...
borg create ssh://myuser@myuser.your-backup.de:23/./folder::bckp_name /target/
borg_last_backup_error_code=$?
echo "borg_last_backup_error_code $borg_last_backup_error_code" > /path/to/file_with_metrics

echo "borg_last_run_timestamp $(date +%s)" >> /path/to/file_with_metrics

borg_storage_free_space=$((ssh myuser@myuser.your-backup.de 'df') | tail -n 1 | awk '{print $4}')
echo "borg_storage_free_space $borg_storage_free_space" >> /path/to/file_with_metrics

borg_storage_disk_space=$((ssh myuser@myuser.your-backup.de 'df') | tail -n 1 | awk '{print $2}')
echo "borg_storage_disk_space $borg_storage_disk_space" >> /path/to/file_with_metrics

What's going on here?

borg create … – starts creating a backup and saving it to the storage box. For the command to work, some additional env variables are needed, but I'm omitting those.

Next, we write the error code for the borg create command to the metrics file. 0 – if everything is ok.

The next thing is to add the current time in unix format to the metrics file.

And the last two – we get the summary space on the storage box (in blocks), add the third line with the metric, get the full size of the available disk, add the 4th line.

After running the script, the /path/to/file_with_metrics file should look like this:

borg_last_backup_error_code 0
borg_last_run_timestamp 1728745782
borg_storage_free_space 444444444
borg_storage_disk_space 555555555

And these metrics will be read by prometheus.

It would be possible to get by with three metrics, but in order not to bother with the size of the blocks, we will consider the remaining space as a percentage.

How often you run the backup script is up to you. In my case, it runs once a day via cron.

And lastly, alerts.

- name: borgbackup_alerts
  interval: 30s
	rules:
    - alert: InstanceDown
      expr: borg_last_backup_error_code > 0
      for: 60s
	  annotations:
		description: 'Oh, no, borg backup failed!'
		summary: 'Borg error code not equal zero'
      labels:
        severity: warning

Above is an example of an alert, configured in the prometheus config. In total, I propose to create 4 of these, i.e. create 4 alert blocks, with these conditions (expr):

  1. up == 0 – if our exporter becomes unavailable

  2. borg_last_backup_error_code > 0 – if the borg create command returns an error (for example, it cannot connect to the storage box)

  3. time() – borg_last_run_timestamp > 2 * 24 * 3600 – if the last run of the script was more than 2 days ago.

  4. (borg_storage_disk_space – borg_storage_free_space) / borg_storage_disk_space * 100 > 80 – if more than 80% of the storage box space is occupied

Well, that seems to be all, simple but quite effective backup monitoring.

And another question for the experts. Is it possible that there is something wrong with the backup, but no alerts will work? )

Well, of course. At a minimum, it would be worth adding the periodic launch of the borg check command (and monitoring its results), which can check both the state of an individual archive and the entire turnip.

Well, if you go further, you can write the following automation: creating a temporary server, restoring a backup to it, checking the functionality (with monitoring the result in prometheus), and deleting the server after checking.

If you have never checked restoring from a backup, consider that you do not have one.

Of course it depends on what kind of data we are backing up. And most likely, if such solutions are used “on the knee,” then these are some kind of local backups.

Mostly pampering, albeit interesting)

Thank you all for your attention!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *