Cloud Native 12 min read

When a Server Silently Crashes, How Long Can Your Cluster Survive? Inside the Heartbeat Failover Mechanism

The article explains how distributed systems detect silently dead nodes using heartbeat mechanisms—both push and pull models—covers trade‑offs between interval and timeout, introduces advanced detectors like Cassandra's Φ, gossip protocols, and quorum rules, and shows real‑world implementations in Kubernetes and etcd.

TonyBai

Mar 20, 2026

When a Server Silently Crashes, How Long Can Your Cluster Survive? Inside the Heartbeat Failover Mechanism

"I'm alive!" — The Underlying Logic of Heartbeats

The simplest heartbeat mechanism is a "life‑or‑death contract" between nodes: a node periodically sends a "I am alive" pulse, which the receiver uses to decide if the sender is still functional.

This is a push model where the node actively reports its status.

Below is a minimal Go implementation of a heartbeat sender and monitor:

package main

import (
    "fmt"
    "sync"
    "time"
)

// Heartbeat message struct
type Heartbeat struct {
    NodeID    string
    Timestamp time.Time
    Sequence  uint64
}

// ----------------- Heartbeat sender -----------------
func StartHeartbeatSender(nodeID string, interval time.Duration) {
    go func() {
        ticker := time.NewTicker(interval)
        defer ticker.Stop()
        var seq uint64 = 0
        for range ticker.C {
            seq++
            hb := Heartbeat{NodeID: nodeID, Timestamp: time.Now(), Sequence: seq}
            // Simulate sending heartbeat over network
            fmt.Printf("Node %s sends heartbeat: Seq %d
", hb.NodeID, hb.Sequence)
        }
    }()
}

// ----------------- Heartbeat monitor -----------------
type Monitor struct {
    mu             sync.RWMutex
    lastHeartbeats map[string]time.Time
    timeout        time.Duration
}

func NewMonitor(timeout time.Duration) *Monitor {
    return &Monitor{lastHeartbeats: make(map[string]time.Time), timeout: timeout}
}

func (m *Monitor) ReceiveHeartbeat(hb Heartbeat) {
    m.mu.Lock()
    defer m.mu.Unlock()
    m.lastHeartbeats[hb.NodeID] = time.Now() // record local receipt time
}

// Check if a node is dead
func (m *Monitor) IsNodeDead(nodeID string) bool {
    m.mu.RLock()
    defer m.mu.RUnlock()
    lastSeen, exists := m.lastHeartbeats[nodeID]
    if !exists {
        return true
    }
    return time.Since(lastSeen) > m.timeout
}

Besides the push model, there is a pull model where a monitor actively queries nodes, as seen in Kubernetes liveness probes or Prometheus metric scraping.

Architectural Trade‑off: Heartbeat Interval vs Timeout

Choosing the right interval and timeout is a classic trade‑off:

Too fast (e.g., every 500 ms): rapid failure detection but high bandwidth and false alarms under slight network jitter.

Too slow (e.g., every 30 s): low overhead but a dead node may go unnoticed for long, causing massive request timeouts.

Industry best practice is to set the timeout to roughly ten times the average RTT (typically <10 ms on LAN) or three times the heartbeat interval, whichever is larger. The following Go function computes a reasonable timeout:

// Calculate a reasonable timeout
func CalculateTimeout(rtt time.Duration, interval time.Duration) time.Duration {
    rttBased := rtt * 10
    intervalBased := interval * 3
    // Choose the larger to avoid mis‑judgment due to occasional jitter
    if rttBased > intervalBased {
        return rttBased
    }
    return intervalBased
}

Robust systems also tolerate a few missed heartbeats (typically 3‑5) before evicting a node from the load‑balancer pool.

When Basic Mechanisms Fail, How Big Companies Design Fault Detection

Fixed timeouts become problematic at massive scale because network conditions vary. Leading open‑source projects adopt more sophisticated detectors.

Advanced Algorithm 1: Cassandra's Φ Accrual Failure Detector

Cassandra records historical heartbeat latencies and computes a Φ value. A slight delay (e.g., 1 s) only raises Φ modestly; the node is not marked dead. When Φ exceeds a threshold (default 8, implying 99.9999 % confidence of death), the node is considered offline.

Note: The Φ algorithm originates from the paper "The φ accrual failure detector".

Advanced Algorithm 2: Decentralized Gossip Protocol

In very large clusters, a central monitor would become a bottleneck. Gossip lets each node randomly exchange heartbeat lists with a few peers, spreading failure information exponentially, similar to gossip spreading in a village.

// Minimal Gossip node state merge logic
type GossipNode struct {
    NodeID          string
    HeartbeatCounter uint64
}

// Merge received gossip list into local view
func MergeGossipList(local map[string]uint64, received map[string]uint64) {
    for nodeID, receivedCount := range received {
        localCount, exists := local[nodeID]
        // Keep the larger counter (proves newer information)
        if !exists || receivedCount > localCount {
            local[nodeID] = receivedCount
        }
    }
}

The Ultimate Nightmare: Split‑Brain and Quorum Rules

Network partitions can cause two halves of a cluster to lose contact, each believing the other is dead and electing its own leader, leading to split‑brain where concurrent writes diverge.

Quorum safeguards require more than half of the nodes (N/2 + 1) to be reachable before the cluster accepts writes, ensuring consistency.

// Quorum‑based protection logic
type QuorumMonitor struct { TotalNodes int }

func (q *QuorumMonitor) HasQuorum(reachableNodes int) bool {
    quorumSize := (q.TotalNodes / 2) + 1
    return reachableNodes >= quorumSize
}

func (q *QuorumMonitor) CanAcceptWrites(reachableNodes int) bool {
    if !q.HasQuorum(reachableNodes) {
        fmt.Println("Quorum lost! Reject all writes to prevent split‑brain!")
        return false
    }
    return true
}

When a partition occurs, the side with fewer than half the nodes automatically stops serving writes, preserving data integrity.

Summary: Heartbeat Mechanisms You Use Every Day

Kubernetes: Kubelet sends a heartbeat every 10 s; after 40 s of silence the node is marked NotReady. Pod liveness/readiness probes are pull‑model heartbeats.

etcd: Based on Raft, the leader sends heartbeats to followers every 100 ms; missing heartbeats for 1000 ms triggers a new election.

When designing high‑availability microservices, avoid naïvely using fixed sleeps or simple error checks. Consider network latency, retry tolerance, gossip dissemination, and quorum protection—heartbeat mechanisms are the last line of defense against system avalanches.

References:

https://arpitbhayani.me/blogs/phi-accrual

https://arpitbhayani.me/blogs/heartbeats-in-distributed-systems

https://www.semanticscholar.org/paper/The-spl-phi-accrual-failure-detector-Hayashibara-D%C3%A9fago/11ae4c0c0d0c36dc177c1fff5eb84fa49aa3e1a8

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems Kubernetes fault detection heartbeat gossip protocol Cassandra quorum phi detector

Written by

TonyBai

Tony Bai's tech world (tonybai.com). Not satisfied with just "knowing how", we strive for mastery. Focused on Go language internals, high-quality engineering practices, and cloud‑native architecture, exploring cutting‑edge intersections of Go and AI. Gophers who pursue technology are welcome—follow me and evolve with Go.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.