When a Server Silently Crashes, How Long Can Your Cluster Survive? Inside the Heartbeat Failover Mechanism
The article explains how distributed systems detect silently dead nodes using heartbeat mechanisms—both push and pull models—covers trade‑offs between interval and timeout, introduces advanced detectors like Cassandra's Φ, gossip protocols, and quorum rules, and shows real‑world implementations in Kubernetes and etcd.
"I'm alive!" — The Underlying Logic of Heartbeats
The simplest heartbeat mechanism is a "life‑or‑death contract" between nodes: a node periodically sends a "I am alive" pulse, which the receiver uses to decide if the sender is still functional.
This is a push model where the node actively reports its status.
Below is a minimal Go implementation of a heartbeat sender and monitor:
package main
import (
"fmt"
"sync"
"time"
)
// Heartbeat message struct
type Heartbeat struct {
NodeID string
Timestamp time.Time
Sequence uint64
}
// ----------------- Heartbeat sender -----------------
func StartHeartbeatSender(nodeID string, interval time.Duration) {
go func() {
ticker := time.NewTicker(interval)
defer ticker.Stop()
var seq uint64 = 0
for range ticker.C {
seq++
hb := Heartbeat{NodeID: nodeID, Timestamp: time.Now(), Sequence: seq}
// Simulate sending heartbeat over network
fmt.Printf("Node %s sends heartbeat: Seq %d
", hb.NodeID, hb.Sequence)
}
}()
}
// ----------------- Heartbeat monitor -----------------
type Monitor struct {
mu sync.RWMutex
lastHeartbeats map[string]time.Time
timeout time.Duration
}
func NewMonitor(timeout time.Duration) *Monitor {
return &Monitor{lastHeartbeats: make(map[string]time.Time), timeout: timeout}
}
func (m *Monitor) ReceiveHeartbeat(hb Heartbeat) {
m.mu.Lock()
defer m.mu.Unlock()
m.lastHeartbeats[hb.NodeID] = time.Now() // record local receipt time
}
// Check if a node is dead
func (m *Monitor) IsNodeDead(nodeID string) bool {
m.mu.RLock()
defer m.mu.RUnlock()
lastSeen, exists := m.lastHeartbeats[nodeID]
if !exists {
return true
}
return time.Since(lastSeen) > m.timeout
}Besides the push model, there is a pull model where a monitor actively queries nodes, as seen in Kubernetes liveness probes or Prometheus metric scraping.
Architectural Trade‑off: Heartbeat Interval vs Timeout
Choosing the right interval and timeout is a classic trade‑off:
Too fast (e.g., every 500 ms): rapid failure detection but high bandwidth and false alarms under slight network jitter.
Too slow (e.g., every 30 s): low overhead but a dead node may go unnoticed for long, causing massive request timeouts.
Industry best practice is to set the timeout to roughly ten times the average RTT (typically <10 ms on LAN) or three times the heartbeat interval, whichever is larger. The following Go function computes a reasonable timeout:
// Calculate a reasonable timeout
func CalculateTimeout(rtt time.Duration, interval time.Duration) time.Duration {
rttBased := rtt * 10
intervalBased := interval * 3
// Choose the larger to avoid mis‑judgment due to occasional jitter
if rttBased > intervalBased {
return rttBased
}
return intervalBased
}Robust systems also tolerate a few missed heartbeats (typically 3‑5) before evicting a node from the load‑balancer pool.
When Basic Mechanisms Fail, How Big Companies Design Fault Detection
Fixed timeouts become problematic at massive scale because network conditions vary. Leading open‑source projects adopt more sophisticated detectors.
Advanced Algorithm 1: Cassandra's Φ Accrual Failure Detector
Cassandra records historical heartbeat latencies and computes a Φ value. A slight delay (e.g., 1 s) only raises Φ modestly; the node is not marked dead. When Φ exceeds a threshold (default 8, implying 99.9999 % confidence of death), the node is considered offline.
Note: The Φ algorithm originates from the paper "The φ accrual failure detector".
Advanced Algorithm 2: Decentralized Gossip Protocol
In very large clusters, a central monitor would become a bottleneck. Gossip lets each node randomly exchange heartbeat lists with a few peers, spreading failure information exponentially, similar to gossip spreading in a village.
// Minimal Gossip node state merge logic
type GossipNode struct {
NodeID string
HeartbeatCounter uint64
}
// Merge received gossip list into local view
func MergeGossipList(local map[string]uint64, received map[string]uint64) {
for nodeID, receivedCount := range received {
localCount, exists := local[nodeID]
// Keep the larger counter (proves newer information)
if !exists || receivedCount > localCount {
local[nodeID] = receivedCount
}
}
}The Ultimate Nightmare: Split‑Brain and Quorum Rules
Network partitions can cause two halves of a cluster to lose contact, each believing the other is dead and electing its own leader, leading to split‑brain where concurrent writes diverge.
Quorum safeguards require more than half of the nodes (N/2 + 1) to be reachable before the cluster accepts writes, ensuring consistency.
// Quorum‑based protection logic
type QuorumMonitor struct { TotalNodes int }
func (q *QuorumMonitor) HasQuorum(reachableNodes int) bool {
quorumSize := (q.TotalNodes / 2) + 1
return reachableNodes >= quorumSize
}
func (q *QuorumMonitor) CanAcceptWrites(reachableNodes int) bool {
if !q.HasQuorum(reachableNodes) {
fmt.Println("Quorum lost! Reject all writes to prevent split‑brain!")
return false
}
return true
}When a partition occurs, the side with fewer than half the nodes automatically stops serving writes, preserving data integrity.
Summary: Heartbeat Mechanisms You Use Every Day
Kubernetes: Kubelet sends a heartbeat every 10 s; after 40 s of silence the node is marked NotReady. Pod liveness/readiness probes are pull‑model heartbeats.
etcd: Based on Raft, the leader sends heartbeats to followers every 100 ms; missing heartbeats for 1000 ms triggers a new election.
When designing high‑availability microservices, avoid naïvely using fixed sleeps or simple error checks. Consider network latency, retry tolerance, gossip dissemination, and quorum protection—heartbeat mechanisms are the last line of defense against system avalanches.
References:
https://arpitbhayani.me/blogs/phi-accrual
https://arpitbhayani.me/blogs/heartbeats-in-distributed-systems
https://www.semanticscholar.org/paper/The-spl-phi-accrual-failure-detector-Hayashibara-D%C3%A9fago/11ae4c0c0d0c36dc177c1fff5eb84fa49aa3e1a8
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
TonyBai
Tony Bai's tech world (tonybai.com). Not satisfied with just "knowing how", we strive for mastery. Focused on Go language internals, high-quality engineering practices, and cloud‑native architecture, exploring cutting‑edge intersections of Go and AI. Gophers who pursue technology are welcome—follow me and evolve with Go.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
