Re‑engineering a Scalable Service Health‑Check System for Cloud‑Native Ops
This article details the redesign of a service health‑check component, covering its original limitations, industry alternatives, the chosen centralized active checking approach, architectural modules, concurrency model, scaling mechanisms, gray‑release strategy, and performance optimizations for reliable distributed systems.
Background
Service Health Check Overview
Service health checks address unhealthy nodes in distributed applications. Consumers obtain provider addresses from a registry and select a machine via load‑balancing. If a machine crashes unnoticed, traffic loss occurs; a health‑check mechanism that detects and removes unhealthy nodes greatly improves stability.
When a node unexpectedly goes down, the consumer cannot perceive it, leading to traffic loss. A detection mechanism that promptly removes unhealthy nodes significantly enhances online service stability.
Original Implementation
The original health‑check component, part of a self‑developed registry, followed three stages:
Obtain instances (IP and port) to be checked from the registry.
Initiate a TCP connection to each address; a successful connection indicates health.
Remove unhealthy instances and restore previously unhealthy but now healthy ones via registry APIs.
The process includes details such as excluding non‑essential services (e.g., MySQL, Redis) from probing, applying strategies like multiple consecutive failures before marking a node unhealthy, and setting removal thresholds to avoid removing an entire cluster.
Problems with the Original Component
1. Capacity Issue
The component was originally designed for a single physical machine with few instances. As the number of instances grew, the single‑machine design became a bottleneck, leading to a configuration‑based sharding approach.
2. Disaster‑Recovery Issue
A single‑machine deployment means that if a node crashes, its health‑check tasks fail, causing a loss of monitoring for the instances it was responsible for.
3. Deployment Efficiency Issue
Configuration changes are required for scaling or hardware replacement, leading to low operational efficiency and high error risk.
4. New‑Requirement Support Efficiency Issue
Emerging cloud‑native requirements (e.g., richer health semantics beyond simple port connectivity) and the need to reuse the component for services outside the registry created a conflict with the legacy design.
5. Stability During Iteration Issue
The original component lacked a gray‑release mechanism; new features were deployed all at once, risking large‑scale failures.
Technical Investigation
Industry Health‑Check Solutions
We examined both registry‑center and non‑registry health‑check approaches.
Registry‑Center Health Checks
Typical solutions include SDK heartbeat reporting, SDK long‑connection with heartbeat, and centralized active health checks. Each has trade‑offs such as resource consumption, implementation complexity, or centralized load pressure.
Non‑Registry Health Checks – K8s LivenessProbe
Kubernetes provides native distributed health checks (sidecar mode). Compared with centralized checks, K8s probes run on the same node as the container, support multiple protocols (TCP, HTTP, exec, gRPC), and can kill unhealthy containers, but they lack fallback strategies and are not directly applicable to our mixed physical‑machine environment.
Choosing a Solution for Our Company
Given a heterogeneous stack (Java, Go, PHP, C++), we favored a 瘦SDK approach to keep costs low. Consequently, we excluded the SDK long‑connection + heartbeat scheme and also ruled out K8s LivenessProbe because it would terminate nodes and required custom fallback logic.
We ultimately selected a centralized active health‑check model, the same paradigm as the original component, but with a redesigned architecture.
Ideal State
Automatic failover.
Horizontal scalability.
Rapid support for flexible new requirements.
Stable iteration while adding new features.
Design & Development
Overall Architecture
The new component consists of four modules:
Dispatcher – fetches data from sources and dispatches tasks.
Prober – executes health‑check tasks.
Decider – decides whether to change health status based on results.
Performer – performs actions (e.g., updating the registry) according to decisions.
Modules expose interfaces externally while hiding internal implementations; data sources are abstracted via interfaces for easy replacement.
Service Discovery Model
A service name is unique within the company. A service may contain multiple clusters (physical or logical), each holding a set of addresses.
Coroutine Model Design
Go was chosen because health checks are I/O‑intensive and its goroutine scheduler fits the workload. The data source is cached in memory and refreshed every minute, while address data is fetched in real time.
Dispatcher obtains all services, then clusters, and spawns N goroutines to fetch addresses for each cluster, creating individual tasks for the Prober.
Prober maintains a queue of tasks and runs many goroutines to perform I/O‑bound health checks, forwarding results to Decider.
Decider must ensure sequential decision‑making per cluster to avoid over‑removal. It routes results to N queues based on a hash of service + cluster, guaranteeing that each cluster’s results are processed by a single goroutine.
Performer simply calls the update interface using the same goroutine that made the decision.
Horizontal Scaling & Automatic Failover
Each instance registers itself with a central coordinator (e.g., etcd) and watches other nodes' status. Tasks are sharded by hashing the service name; a node only processes tasks assigned to it, enabling dynamic scaling and automatic failover.
Small‑Traffic Mechanism
Two clusters are deployed: a normal cluster handling most services and a small‑traffic cluster handling less critical services for gray‑release testing. Configuration is shared and can be scoped by organization, service, cluster, and environment.
Extensibility
Pluggable Data Sources
Data sources are abstracted as read and write interfaces, allowing seamless integration of new sources.
Extensible Check Methods
Health checks are defined by an address and configuration. Implementations currently include TCP and HTTP; future extensions may add Dubbo, gRPC, Thrift, etc.
Filters
A responsibility‑chain pattern filters out services, clusters, or instances that should not be checked, making it easy to add or remove filtering logic.
Gray Release
Two measures were taken to replace the old component smoothly:
A degradation switch configurable by organization, service, cluster, and environment, with three levels: no degradation, partial degradation (only recovery), and full degradation (checks run but no removal or recovery).
Gradual migration using the small‑traffic design: gray‑released services use the new component, others continue with the old one until full migration.
Pitfalls & Optimizations
During gray release, large clusters (over 1000 nodes) caused sequential decision processing to exceed the target 3‑second round‑trip. First optimization filtered out disabled machines during task dispatch, reducing unnecessary checks.
In production, disabled‑machine filtering had limited effect. The second optimization classified results: healthy results were dispatched randomly to any queue, while unhealthy results were routed deterministically (service + cluster) to ensure ordered decision‑making. This balanced queue load and preserved correctness.
Conclusion
The article presented the background of service health checks, identified shortcomings of the legacy component, surveyed industry solutions, selected a suitable approach, and designed a more robust system covering the full development lifecycle from design to gray‑release. It emphasizes that system design is a trade‑off process; the best solution fits the specific business context rather than blindly copying others.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Xiao Lou's Tech Notes
Backend technology sharing, architecture design, performance optimization, source code reading, troubleshooting, and pitfall practices
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
