How Ceph Detects Node Failures: Heartbeat, Reporting, and Monitor Strategies
This article explains Ceph's fault detection mechanism, detailing how OSD peers exchange heartbeats, report failures to the Monitor, and how the Monitor aggregates reports and applies configurable thresholds to reliably identify and handle downed OSD nodes in a distributed storage cluster.
Background Introduction
Node failure detection is an unavoidable issue in distributed systems; clusters must sense node liveness and adjust accordingly. Heartbeat mechanisms are commonly used, assuming a node that maintains regular heartbeats can provide services. An effective fault‑detection strategy should be timely, apply appropriate pressure, tolerate network jitter, and propagate state changes throughout the cluster.
Ceph Fault Detection Mechanism
Ceph uses a centralized architecture where the Ceph Monitor maintains metadata. When a node's status changes, the Monitor must discover and update this information, notifying all OSDs. Directly having the Monitor heartbeat every OSD would overload it at large scale, so Ceph distributes the workload to smarter OSDs and clients.
1. OSD‑to‑OSD Heartbeat
OSDs belonging to the same placement group (PG) act as peer OSDs, exchanging PING/PONG messages and recording timestamps. If a peer OSD times out, the OSD adds it to a failure_queue for later reporting.
Parameters :
osd_heartbeat_interval (default 6 s): interval for sending ping messages, with a random jitter to avoid spikes.
osd_heartbeat_grace (default 20 s): time without a reply after which the peer is considered down.
2. OSD Reports Peer Failures to Monitor
The OSD periodically checks failure_queue , sends failure reports to the Monitor, moves entries to failure_pending , and removes them from the queue. If the Monitor later receives a heartbeat from the reported OSD, it cancels the failure report.
If the connection to the Monitor is lost and later restored, pending reports are re‑queued for retransmission.
Monitor’s handling of reports :
Collects failure reports from OSDs.
When an OSD exceeds a configurable failure threshold and enough distinct OSDs report the same failure, the Monitor marks the OSD down.
Parameters :
osd_heartbeat_grace (20 s): threshold to confirm OSD failure.
mon_osd_reporter_subtree_level ("host"): level at which reports are aggregated.
mon_osd_min_down_reporters (2): minimum distinct reporters required.
mon_osd_adjust_heartbeat_grace (true): adjusts the grace period based on historical latency.
3. OSD‑to‑Monitor Heartbeat
When an OSD experiences PG state changes or after a set interval, it sends a MSG_PGSTATS message to the Monitor (acting as a heartbeat). The Monitor replies with MSG_PGSTATSACK and records the timestamp as last_osd_report . Periodically, the Monitor scans these timestamps; nodes lacking recent reports are marked Down.
Parameters :
mon_osd_report_timeout (900 s): time without a report before marking Down.
osd_mon_report_interval_max (600 s): maximum interval between OSD reports.
osd_mon_report_interval_min (5 s): minimum interval between OSD reports.
Summary
Ceph discovers failed OSDs through peer OSD reports and Monitor‑aggregated heartbeats. It satisfies the key fault‑detection criteria:
Timeliness : Peer OSDs can detect failures within seconds and the Monitor can take OSDs down within minutes, though client writes may block due to consistency requirements.
Appropriate Pressure : By offloading detection to peers, the Monitor’s heartbeat interval can be as long as 600 s, and its detection threshold up to 900 s, reducing central pressure and improving scalability.
Network Jitter Tolerance : The Monitor waits for multiple conditions—exceeding a dynamic grace period, receiving reports from enough distinct hosts, and no cancellation from the source OSD—before declaring an OSD down.
Diffusion : After updating the OSD map, the Monitor lazily waits for OSDs and clients to pull the new state, minimizing broadcast overhead.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.