How Proxmox falls and rises

Proxmox is a dedicated virtualization and containerization distribution based on Debian Linux.
When the needs outgrow one iron server responsible for everything, but are not yet large enough to use Kubernetes, various solutions come to the rescue that allow you to manage a cluster of several hosts, organize High Availability, replication and centralized backup of containers and virtual machines. Proxmox is one of them.

We have been using it for more than two years, and we are very satisfied: it greatly simplifies a lot of things: slicing and reserving resources, live migration (qemu VM’s only), centralized collection of metrics (without the need to cram an exporter / agent into each guest), management (via WebUI , api and ssh).
Our network has grown from three servers to a dozen, of which the number of Proxmox hosts has grown from zero to eight at the moment. But sometimes it breaks too.

What can fail with Proxmox?

In fact, a lot. Network mounts, corosync, unreleased lock…
Some failures will require manual intervention, others are fixed on their own. Some self-remediate in a rather nasty way called fencing – especially nasty if the affected service didn’t have its own clustering. Other failures happen only in a cluster and never on a single host, others vice versa.

Failures of single hosts (nodes)

stuck lock

Proxmox often hangs a padlock on a container or VM – in the interface, this is displayed, in fact, by the lock icon on the corresponding guest. Changing the configuration, removing a backup, deploying / cloning / migrating – any of these operations hangs a lock and the reverse command is not always executed correctly.
The problem is that a stuck lock prevents other operations that require the lock from being executed.

Fortunately, this is easy to deal with. Need to be done

Criticality: low
This type of failure affects only one guest and does not affect its functioning.

NAS failure

As a rule, it is detected by the icon with a question mark on the storage. Proxmox persistently tries to mount the storage if it is enabled in the cluster configuration for the corresponding host, but it does not try to unmount the stuck storage.

Easy to solve:

umount -nlf <mountpoint>

Criticality: low to high
Depending on the type of storage and its purpose, this failure can be either completely non-critical (installation images and container images), and vice versa, if, for example, your backup storage or nfs-share with running containers fails.

Failure of internal services of Proxmox itself

In the interface, this looks like question marks on the host itself, its guests, and repositories. In this state, the host can stay indefinitely even in the cluster, but it will be completely inoperable from the web interface, and an attempt to use the Proxmox utilities through the console will end with a timeout and/or hang.

The best way to treat – reboot via ssh or using a KVM host.
However, if a restart is not acceptable, you can try stopping and restarting the Proxmox services:

for s in pveproxy spiceproxy pvestatd pve-cluster corosync; do systemctl stop $s; done
for s in corosync pve-cluster pvestatd spiceproxy pveproxy; do systemctl stop $s; done

Criticality: medium
Despite the complete unmanageability of the host through the WebUI, running guests continue to work.

Cluster level failures

Replication and HA configuration mismatch

Proxmox can independently replicate guests between hosts and re-raise them if one of the hosts fails. However, the coherency check of these settings is not performed – thus, you can set up replication on one host, and HA on another, after which an attempt to move the guest will end with the inability to start it.
Very frustrating, but easy enough to fix by recreating the HA config for the guest to include the correct host, or by destroying it and moving it /etc/pve/nodes/<host>/<lxc|qemu-server>/<guest>.conf to the correct location (to the appropriate folder of the active host with a valid replica) and running it manually.

Criticality: catastrophic
In fact, this is a time bomb leading to DoS. You don’t want to discover this.

ZFS replication errors

Met once or twice a month before switching to version 7, usually occurs if somehow the host with the replica and the source host diverged in snapshots: replication stores only one snapshot for each replication target, and if the snapshots on the source and target do not match, replication falls .

Solution — delete the corresponding dataset/volume.

Criticality: catastrophic
Another version of the previous time bomb

No quorum in the cluster

An unobvious and difficult-to-diagnose condition. At the same time, guests continue to work (Proxmox is generally quite careful with guests) where they were before the collapse of the quorum, but any operations with hosts or guests become impossible including authorization via WebUI. This happens because pmxcfs, storing – and replicating between hosts! – cluster settings, goes to read-only in the absence of a quorum.
It may be due to a failure of internal services, network losses, firewall misconfiguration, or even be the result of prolonged network degradation in the recent past.
You can check for quorum via ssh with the command pvecm status. Normal output will contain the following block:

Votequorum information  
Expected votes:   7  
Highest expected: 7  
Total votes:      7  
Quorum:           4     
Flags:            Quorate

This means that there are seven hosts in the cluster (Expected votes), of which at least four are required for the quorum (Quorum), and now all seven hosts are active (Total votes), the quorum is met (Flags: Quorate). If the number of votes is less than required for the quorum, the cluster will be blocked until it is restored.

Solution – according to circumstances. Once, after long problems in the host network, we were helped by restarting internal services on a host that had nothing to do with the observed effect (which was regularly included in the quorum).

Criticality: catastrophic
Any cluster functions stop working, jobs do not start, hosts with HA can go into fencing.

Network errors

In general, Proxmox is quite tolerant of network errors, packet losses and the like, however, if udp packets stop going normally, then events develop as follows:

Criticality: from high to catastrophic
Network problems are always unpleasant, even if they do not lead to degradation of the service.
Regular hosts will keep trying to reconnect to the cluster until they win (and may go into a failed internal service state), while if the host is configured with HA you will see…

Fencing (sudden reboot)!

In general, fencing (fencing, fencing) works for the user Proxmox. Its task, before re-raising (according to HA rules) the guest on another host, is to provide reliable isolation of the failed host and prevent two instances of the guest from conflicting for the same address, since after restarting the host does not start its guests until it receives an update to the current one. cluster configuration.
This should help, for example, if your network card or port in the switch has failed, or you have caught an unstable (with this hardware) update.
However, if the network itself is experiencing problems, diagnosing what is happening can easily be confusing.

It happens (fencing) when quorum is lost on the hosts included in HA: each such host tries to reconnect with the cluster as hard as possible. If it is impossible to connect to the cluster, such hosts perform a hard reboot, which forcibly extinguishes the services raised on them.
IN syslog it looks like a sudden reboot without shutting down: here were the usual working records, but right away the boot log went.
Solution: disinfect the network, roll back the last update, move the cluster network to a physically separate circuit – depending on what causes fencing.
The process is especially visible on virtual machines: in the physical console there is no usual system shutdown log, immediately transition to the boot screen saver, as if pressed Reset.

Criticality: catastrophic, but standard

Instead of misfortune

As with any complex system, the key is to know three things: where to look, what to hold, and what to press. But the experience that gives this knowledge, as always, comes right after it was needed …

Trouble-free uptime, gentlemen!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *