Detecting and Recovering Unhealthy Nodes in Microservice Architectures
This article explores various service health‑checking techniques in microservice environments, detailing how consumers, providers, and registration centers can identify unhealthy nodes through passive and active checks, heartbeat mechanisms, TCP connection monitoring, and registration‑center probing, while weighing trade‑offs in reliability, timeliness, and resource consumption.
Several months ago I wrote an article titled "Four Experiments to Fully Understand TCP Connection Termination" and left a gap about service health checking, which I now fill.
In a microservice architecture, a provider may have multiple nodes, and a consumer selects a healthy node based on load‑balancing. Determining whether a provider node is healthy is the core of service health checking.
A healthy node can respond normally to consumer requests; an unhealthy node cannot. Causes include power loss, network outage, hardware failure, high latency, process crashes, or inability to process requests. In short, a provider node that has not been removed from traffic cannot handle requests. Unhealthy reasons fall into three categories:
System anomalies such as power loss, network loss, hardware failures, or OS crashes.
Process crashes, including abrupt termination or kill signals before deregistration.
Process unable to handle requests, e.g., during a full GC pause.
Node status transitions are called probe death (healthy → unhealthy) and probe alive (unhealthy → healthy), collectively referred to as probe alive.
Consumer Passive Health Check
The simplest method is passive checking on the consumer side: if a request to a provider fails, the consumer removes that provider. To avoid false positives from transient network glitches, a time window and failure threshold (N failures) are used before removal, and the provider is restored after the window expires.
Tools like Nginx support configuring the failure count and time window for marking a provider as unavailable, based on connection errors or specific HTTP status codes (4xx, 5xx). The drawback is that real traffic is needed for detection, which can increase latency during the failure window.
Consumer Active Health Check
Passive checks rely on real traffic; an active approach uses a side‑channel to probe providers. Alibaba’s Tengine open‑sources the nginx_upstream_check_module for active health checks. The side‑channel continuously probes providers, marking them unavailable when anomalies are detected and restoring them when they recover.
Some RPC frameworks also embed active health checks on the consumer side, such as iQIYI’s Dubbo implementation.
Reference: "iQIYI’s Microservice Architecture Practice under Dubbo" (https://developer.aliyun.com/article/771495)
Provider Heartbeat Reporting
When a registry is present, health checking can be delegated to it. The simplest method is heartbeat: after registration, a provider continuously sends heartbeat packets to the registry. The registry marks the provider unavailable if heartbeats stop for a prolonged period and restores it when they resume.
Components like etcd and Redis support lease renewal, enabling similar mechanisms. Nacos 1.x uses temporary instances for this purpose.
Shorter heartbeat intervals improve detection timeliness but increase resource consumption because each heartbeat updates the provider’s last‑heartbeat timestamp.
Provider‑Registry Session Keep‑Alive
To reduce heartbeat overhead, TCP connections can serve as implicit health checks. However, relying solely on TCP has limitations, especially when the OS crashes or the network is lost, which may prevent timely detection.
TCP alone may take up to two hours to notice a dead provider if no data is exchanged. Therefore, an application‑level heartbeat (e.g., one‑minute interval) is recommended for faster detection.
Statistical data (illustrated below) shows system anomalies constitute about 1% of failures, making the probability of missing a dead node relatively low.
Common implementations include Zookeeper used by Dubbo, Nacos 2.x temporary instances, and SOFARegistry.
SOFARegistry introduces a "connection‑sensitive" long‑connection model, which combines TCP with an application‑level heartbeat encapsulated in the SOFABolt framework.
This approach can cause frequent up/down events if network conditions are poor, as TCP connections may drop.
Registry Active Probing
Beyond the methods above, a registry (or a dedicated component) can actively probe providers. Active probing can check port liveness, HTTP endpoint responses, or even protocol‑specific health (MySQL, Redis, etc.).
Nacos’s permanent instances employ this strategy. Active probing requires high availability of the probing component, often achieved with master‑slave replication or distributed coordination.
Three key evaluation metrics for health checks are:
Functionality – can the issue be detected?
Timeliness – how quickly is it detected?
Stability – does the check produce false positives?
Semantic active probing offers the strongest functionality, while long‑connection session keep‑alive provides the best timeliness. Stability is managed by limiting the proportion of nodes that can be removed (e.g., no more than 50% of a cluster) to avoid cascading failures.
Xiao Lou's Tech Notes
Backend technology sharing, architecture design, performance optimization, source code reading, troubleshooting, and pitfall practices
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
