How Proxmox falls and rises
Proxmox is a dedicated virtualization and containerization distribution based on Debian Linux.
When the needs outgrow one iron server responsible for everything, but are not yet large enough to use Kubernetes, various solutions come to the rescue that allow you to manage a cluster of several hosts, organize High Availability, replication and centralized backup of containers and virtual machines. Proxmox is one of them.
We have been using it for more than two years, and we are very satisfied: it greatly simplifies a lot of things: slicing and reserving resources, live migration (qemu VM’s only), centralized collection of metrics (without the need to cram an exporter / agent into each guest), management (via WebUI , api and ssh).
Our network has grown from three servers to a dozen, of which the number of Proxmox hosts has grown from zero to eight at the moment. But sometimes it breaks too.
What can fail with Proxmox?
In fact, a lot. Network mounts, corosync, unreleased lock…
Some failures will require manual intervention, others are fixed on their own. Some self-remediate in a rather nasty way called fencing – especially nasty if the affected service didn’t have its own clustering. Other failures happen only in a cluster and never on a single host, others vice versa.
Failures of single hosts (nodes)
stuck lock
Proxmox often hangs a padlock on a container or VM – in the interface, this is displayed, in fact, by the lock icon on the corresponding guest. Changing the configuration, removing a backup, deploying / cloning / migrating – any of these operations hangs a lock and the reverse command is not always executed correctly.
The problem is that a stuck lock prevents other operations that require the lock from being executed.
Fortunately, this is easy to deal with. Need to be done
pct unlock <CTid>
for containersqm unlock <VMid>
for virtual machines
… on the corresponding Proxmox host
Criticality: low
This type of failure affects only one guest and does not affect its functioning.
NAS failure
As a rule, it is detected by the icon with a question mark on the storage. Proxmox persistently tries to mount the storage if it is enabled in the cluster configuration for the corresponding host, but it does not try to unmount the stuck storage.
Easy to solve:
umount -nlf <mountpoint>
Criticality: low to high
Depending on the type of storage and its purpose, this failure can be either completely non-critical (installation images and container images), and vice versa, if, for example, your backup storage or nfs-share with running containers fails.
Failure of internal services of Proxmox itself
In the interface, this looks like question marks on the host itself, its guests, and repositories. In this state, the host can stay indefinitely even in the cluster, but it will be completely inoperable from the web interface, and an attempt to use the Proxmox utilities through the console will end with a timeout and/or hang.
The best way to treat – reboot via ssh or using a KVM host.
However, if a restart is not acceptable, you can try stopping and restarting the Proxmox services:
for s in pveproxy spiceproxy pvestatd pve-cluster corosync; do systemctl stop $s; done
for s in corosync pve-cluster pvestatd spiceproxy pveproxy; do systemctl stop $s; done
Criticality: medium
Despite the complete unmanageability of the host through the WebUI, running guests continue to work.
Cluster level failures
Replication and HA configuration mismatch
Proxmox can independently replicate guests between hosts and re-raise them if one of the hosts fails. However, the coherency check of these settings is not performed – thus, you can set up replication on one host, and HA on another, after which an attempt to move the guest will end with the inability to start it.
Very frustrating, but easy enough to fix by recreating the HA config for the guest to include the correct host, or by destroying it and moving it /etc/pve/nodes/<host>/<lxc|qemu-server>/<guest>.conf
to the correct location (to the appropriate folder of the active host with a valid replica) and running it manually.
Criticality: catastrophic
In fact, this is a time bomb leading to DoS. You don’t want to discover this.
ZFS replication errors
Met once or twice a month before switching to version 7, usually occurs if somehow the host with the replica and the source host diverged in snapshots: replication stores only one snapshot for each replication target, and if the snapshots on the source and target do not match, replication falls .
Solution — delete the corresponding dataset/volume.
Criticality: catastrophic
Another version of the previous time bomb
No quorum in the cluster
An unobvious and difficult-to-diagnose condition. At the same time, guests continue to work (Proxmox is generally quite careful with guests) where they were before the collapse of the quorum, but any operations with hosts or guests become impossible including authorization via WebUI. This happens because pmxcfs
, storing – and replicating between hosts! – cluster settings, goes to read-only
in the absence of a quorum.
It may be due to a failure of internal services, network losses, firewall misconfiguration, or even be the result of prolonged network degradation in the recent past.
You can check for quorum via ssh with the command pvecm status
. Normal output will contain the following block:
Votequorum information
----------------------
Expected votes: 7
Highest expected: 7
Total votes: 7
Quorum: 4
Flags: Quorate
This means that there are seven hosts in the cluster (Expected votes), of which at least four are required for the quorum (Quorum), and now all seven hosts are active (Total votes), the quorum is met (Flags: Quorate). If the number of votes is less than required for the quorum, the cluster will be blocked until it is restored.
Solution – according to circumstances. Once, after long problems in the host network, we were helped by restarting internal services on a host that had nothing to do with the observed effect (which was regularly included in the quorum).
Criticality: catastrophic
Any cluster functions stop working, jobs do not start, hosts with HA can go into fencing.
Network errors
In general, Proxmox is quite tolerant of network errors, packet losses and the like, however, if udp packets stop going normally, then events develop as follows:
- due to packet loss, the exchange of service information between nodes is disrupted
- quorum is rebuilt
corosync
- one or more hosts lose connection to the cluster and try to rejoin it
- if HA is configured for the host, then it performs a hard reboot
It looks something like this:
systemd[1]: rsyslog.service: Sent signal SIGHUP to main process 3087 (rsyslogd) on client request. systemd[1]: logrotate.service: Succeeded. systemd[1]: Finished Rotate log files. systemd[1]: logrotate.service: Consumed 1.067s CPU time. spiceproxy[1503847]: worker exit spiceproxy[3908]: worker 1503847 finished pveproxy[777811]: worker exit pveproxy[1280387]: worker exit pveproxy[888694]: error AnyEvent::Util: Runtime error in AnyEvent::guard callback: Can't call method "_put_session" on an undefined value at /usr/lib/x86_64-linux-gnu/perl5/5.32/AnyEvent/Handle.pm line 2259 during global destruction. pveproxy[3902]: worker 777811 finished pveproxy[3902]: worker 888694 finished pveproxy[3902]: worker 1280387 finished // Link to another host is garbled corosync[3813]: [KNET ] link: host: 8 link: 0 is down corosync[3813]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1) corosync[3813]: [KNET ] host: host: 8 has no active links corosync[3813]: [KNET ] rx: host: 8 link: 0 is up corosync[3813]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1) pveproxy[1643087]: got inotify poll request in wrong process - disabling inotify pveproxy[1643087]: worker exit // by corosync udp-packets disappearing corosync[3813]: [TOTEM ] Token has not been received in 5266 ms corosync[3813]: [KNET ] link: host: 1 link: 0 is down corosync[3813]: [KNET ] link: host: 8 link: 0 is down corosync[3813]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) corosync[3813]: [KNET ] host: host: 1 has no active links corosync[3813]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1) corosync[3813]: [KNET ] host: host: 8 has no active links corosync[3813]: [TOTEM ] Token has not been received in 12169 ms corosync[3813]: [KNET ] rx: host: 1 link: 0 is up corosync[3813]: [KNET ] host: host: 1 (passive) best link: 0 (pri: 1) corosync[3813]: [KNET ] rx: host: 8 link: 0 is up corosync[3813]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1) corosync[3813]: [TOTEM ] Token has not been received in 19848 ms // Quorum disruption corosync[3813]: [QUORUM] Sync members[1]: 5 corosync[3813]: [QUORUM] Sync left[4]: 1 3 4 8 corosync[3813]: [TOTEM ] A new membership (5.13620) was formed. Members left: 1 3 4 8 corosync[3813]: [TOTEM ] Failed to receive the leave message. failed: 1 3 4 8 pmxcfs[3703]: [dcdb] notice: members: 5/3703 pmxcfs[3703]: [status] notice: members: 5/3703 // HA-hosts may do fencing there corosync[3813]: [QUORUM] This node is within the non-primary component and will NOT provide any services. corosync[3813]: [QUORUM] Members[1]: 5 // This cake is a lie, host is blocked corosync[3813]: [MAIN ] Completed service synchronization, ready to provide service. // Because of quorum absence pmxcfs[3703]: [status] notice: node lost quorum pmxcfs[3703]: [dcdb] crit: received write while not quorate - trigger resync pmxcfs[3703]: [dcdb] crit: leaving CPG group pve-ha-lrm[3910]: lost lock 'ha_agent_pve5_lock - cfs lock update failed - Operation not permitted pve-ha-lrm[3910]: status change active => lost_agent_lock // Quorum restoration corosync[3813]: [QUORUM] Sync members[5]: 1 3 4 5 8 corosync[3813]: [QUORUM] Sync joined[4]: 1 3 4 8 corosync[3813]: [TOTEM ] A new membership (1.13624) was formed. Members joined: 1 3 4 8 pmxcfs[3703]: [status] notice: members: 1/3409, 3/7179, 4/3856, 5/3703, 8/2042 pmxcfs[3703]: [status] notice: starting data syncronisation // Host unbloked for this moment corosync[3813]: [QUORUM] This node is within the primary component and will provide service. corosync[3813]: [QUORUM] Members[5]: 1 3 4 5 8 corosync[3813]: [MAIN ] Completed service synchronization, ready to provide service. pmxcfs[3703]: [status] notice: node has quorum pmxcfs[3703]: [status] notice: received sync request (epoch 1/3409/000000A6) pmxcfs[3703]: [status] notice: received sync request (epoch 1/3409/000000A7) pmxcfs[3703]: [dcdb] notice: start cluster connection // Problems persists pmxcfs[3703]: [dcdb] crit: cpg_join failed: 14 pmxcfs[3703]: [dcdb] crit: can't initialize service pve-ha-crm[3900]: lost lock 'ha_manager_lock - cfs lock update failed - Device or resource busy pve-ha-crm[3900]: status change master => lost_manager_lock // Another place for fencing pve-ha-crm[3900]: watchdog closed (disabled) pve-ha-crm[3900]: status change lost_manager_lock => wait_for_quorum pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pve-ha-crm[3900]: status change wait_for_quorum => slave pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pvescheduler[1634357]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout corosync[3813]: [TOTEM ] Token has not been received in 5396 ms pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 pmxcfs[3703]: [dcdb] crit: cpg_send_message failed: 9 corosync[3813]: [TOTEM ] Token has not been received in 12299 ms corosync[3813]: [TOTEM ] Retransmit List: b e corosync[3813]: [KNET ] link: host: 8 link: 0 is down corosync[3813]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1) corosync[3813]: [KNET ] host: host: 8 has no active links corosync[3813]: [KNET ] rx: host: 8 link: 0 is up corosync[3813]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1) // All thing became good for now pmxcfs[3703]: [status] notice: received all states corosync[3813]: [QUORUM] Sync members[5]: 1 3 4 5 8 corosync[3813]: [TOTEM ] A new membership (1.13630) was formed. Members corosync[3813]: [QUORUM] Members[5]: 1 3 4 5 8 corosync[3813]: [MAIN ] Completed service synchronization, ready to provide service. pmxcfs[3703]: [status] notice: cpg_send_message retried 1 times pmxcfs[3703]: [status] notice: all data is up to date pmxcfs[3703]: [status] notice: dfsm_deliver_queue: queue length 106 pmxcfs[3703]: [dcdb] notice: members: 1/3409, 3/7179, 4/3856, 5/3703, 8/2042 pmxcfs[3703]: [dcdb] notice: starting data syncronisation pmxcfs[3703]: [dcdb] notice: received sync request (epoch 1/3409/000000C2) pmxcfs[3703]: [dcdb] notice: received all states pmxcfs[3703]: [dcdb] notice: leader is 1/3409 pmxcfs[3703]: [dcdb] notice: synced members: 1/3409, 3/7179, 4/3856, 8/2042 pmxcfs[3703]: [dcdb] notice: waiting for updates from leader pmxcfs[3703]: [dcdb] notice: dfsm_deliver_queue: queue length 2 pmxcfs[3703]: [dcdb] notice: update complete - trying to commit (got 7 inode updates) pmxcfs[3703]: [dcdb] notice: all data is up to date pmxcfs[3703]: [dcdb] notice: dfsm_deliver_queue: queue length 2 pve-ha-lrm[3910]: successfully acquired lock 'ha_agent_pve5_lock' pve-ha-crm[3900]: successfully acquired lock 'ha_manager_lock' pve-ha-lrm[3910]: status change lost_agent_lock => active pve-ha-crm[3900]: watchdog active pve-ha-crm[3900]: status change slave => master // Bad again corosync[3813]: [TOTEM ] Token has not been received in 5363 ms corosync[3813]: [TOTEM ] Token has not been received in 12264 ms corosync[3813]: [QUORUM] Sync members[5]: 1 3 4 5 8 corosync[3813]: [TOTEM ] A new membership (1.1363c) was formed. Members corosync[3813]: [QUORUM] Members[5]: 1 3 4 5 8 corosync[3813]: [MAIN ] Completed service synchronization, ready to provide service. pvescheduler[1653190]: jobs: cfs-lock 'file-jobs_cfg' error: got lock request timeout pvescheduler[1653189]: replication: cfs-lock 'file-replication_cfg' error: got lock request timeout pveproxy[1641105]: proxy detected vanished client connection corosync[3813]: [TOTEM ] Token has not been received in 5399 ms corosync[3813]: [TOTEM ] Token has not been received in 12301 ms corosync[3813]: [QUORUM] Sync members[5]: 1 3 4 5 8 corosync[3813]: [TOTEM ] A new membership (1.13648) was formed. Members corosync[3813]: [QUORUM] Members[5]: 1 3 4 5 8 corosync[3813]: [MAIN ] Completed service synchronization, ready to provide service. corosync[3813]: [TOTEM ] Token has not been received in 5402 ms corosync[3813]: [KNET ] link: host: 8 link: 0 is down corosync[3813]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1) corosync[3813]: [KNET ] host: host: 8 has no active links corosync[3813]: [TOTEM ] Token has not been received in 12303 ms corosync[3813]: [KNET ] rx: host: 8 link: 0 is up corosync[3813]: [KNET ] host: host: 8 (passive) best link: 0 (pri: 1) corosync[3813]: [QUORUM] Sync members[5]: 1 3 4 5 8 corosync[3813]: [TOTEM ] A new membership (1.13654) was formed. Members corosync[3813]: [QUORUM] Members[5]: 1 3 4 5 8 corosync[3813]: [MAIN ] Completed service synchronization, ready to provide service.
Criticality: from high to catastrophic
Network problems are always unpleasant, even if they do not lead to degradation of the service.
Regular hosts will keep trying to reconnect to the cluster until they win (and may go into a failed internal service state), while if the host is configured with HA you will see…
Fencing (sudden reboot)!
In general, fencing (fencing, fencing) works for the user Proxmox. Its task, before re-raising (according to HA rules) the guest on another host, is to provide reliable isolation of the failed host and prevent two instances of the guest from conflicting for the same address, since after restarting the host does not start its guests until it receives an update to the current one. cluster configuration.
This should help, for example, if your network card or port in the switch has failed, or you have caught an unstable (with this hardware) update.
However, if the network itself is experiencing problems, diagnosing what is happening can easily be confusing.
It happens (fencing) when quorum is lost on the hosts included in HA: each such host tries to reconnect with the cluster as hard as possible. If it is impossible to connect to the cluster, such hosts perform a hard reboot, which forcibly extinguishes the services raised on them.
IN syslog
it looks like a sudden reboot without shutting down: here were the usual working records, but right away the boot log went.
Solution: disinfect the network, roll back the last update, move the cluster network to a physically separate circuit – depending on what causes fencing.
The process is especially visible on virtual machines: in the physical console there is no usual system shutdown log, immediately transition to the boot screen saver, as if pressed Reset
.
Criticality: catastrophic, but standard
Instead of misfortune
As with any complex system, the key is to know three things: where to look, what to hold, and what to press. But the experience that gives this knowledge, as always, comes right after it was needed …
Trouble-free uptime, gentlemen!