What Happened
A node in the Proxmox cluster became unreachable (ARP unreachable). In a 2-node cluster without an external QDevice this means: quorum lost — both nodes immediately halt all VM operations. No start, no stop, no migration.
This is not a bug — it’s Corosync design. A 2-node cluster needs both nodes for quorum (2/2 votes = 100%). If one fails, there is no majority.
Diagnosis
# Check quorum status
pvecm status
# Output:
# Quorum information
# Nodes: 2
# Expected votes: 2
# Total votes: 1 ← Problem: only 1/2 votes
# Quorum votes: 2
# Flags: Quorate: no ← No quorum
VMs located on the reachable node could not be started even though the node itself was fully online.
Root Cause
ARP table inconsistency between the node and the switch. The node was physically reachable (SSH worked), but the Corosync heartbeat interface reported unreachable due to stale ARP entries.
Immediate Action
# On the still reachable node:
# Force quorum with 1 vote (CAUTION: emergency only)
pvecm expected 1
# Then start VMs, stabilize system
Warning:
pvecm expected 1is an emergency measure and creates a split-brain risk if both nodes are actually running in parallel. Only use this when certain the other node is truly offline.
Permanent Solution: Corosync QDevice
A QDevice is a lightweight third voter — not a full Proxmox node, but a small service running on any Linux system. With QDevice the cluster has 3 votes (Node A + Node B + QDevice), quorum at 2/3. If one node fails, the remaining 2 still have quorum.
# Install QDevice (on external system, e.g. small VPS)
apt install corosync-qnetd
# Add QDevice to Proxmox cluster
pvecm qdevice setup <qdevice-host>
# Check status
pvecm status
# Expected votes: 3
# Quorum votes: 2 ← Now quorum with 1 failed node
Takeaways
- 2-node cluster without QDevice = single point of failure — regardless of which node fails.
- ARP issues are insidious — the node is reachable, Corosync doesn’t see it.
- QDevice costs almost no resources — a
t2.nanoor similar mini VPS is sufficient. - Document recovery — the next incident will come, and you don’t want to search for instructions then.
Recovery Time
From recognizing the problem to restoring normal operations: ~25 minutes. 15 minutes diagnosis (identifying ARP as root cause), 10 minutes recovery.