Backend Development 17 min read

Re‑engineering a Scalable Service Health‑Check System for Cloud‑Native Ops

This article details the redesign of a service health‑check component, covering its original limitations, industry alternatives, the chosen centralized active checking approach, architectural modules, concurrency model, scaling mechanisms, gray‑release strategy, and performance optimizations for reliable distributed systems.

Xiao Lou's Tech Notes

Nov 28, 2022

Re‑engineering a Scalable Service Health‑Check System for Cloud‑Native Ops

Background

Service Health Check Overview

Service health checks address unhealthy nodes in distributed applications. Consumers obtain provider addresses from a registry and select a machine via load‑balancing. If a machine crashes unnoticed, traffic loss occurs; a health‑check mechanism that detects and removes unhealthy nodes greatly improves stability.

When a node unexpectedly goes down, the consumer cannot perceive it, leading to traffic loss. A detection mechanism that promptly removes unhealthy nodes significantly enhances online service stability.

Original Implementation

The original health‑check component, part of a self‑developed registry, followed three stages:

Obtain instances (IP and port) to be checked from the registry.

Initiate a TCP connection to each address; a successful connection indicates health.

Remove unhealthy instances and restore previously unhealthy but now healthy ones via registry APIs.

The process includes details such as excluding non‑essential services (e.g., MySQL, Redis) from probing, applying strategies like multiple consecutive failures before marking a node unhealthy, and setting removal thresholds to avoid removing an entire cluster.

Problems with the Original Component

1. Capacity Issue

The component was originally designed for a single physical machine with few instances. As the number of instances grew, the single‑machine design became a bottleneck, leading to a configuration‑based sharding approach.

2. Disaster‑Recovery Issue

A single‑machine deployment means that if a node crashes, its health‑check tasks fail, causing a loss of monitoring for the instances it was responsible for.

3. Deployment Efficiency Issue

Configuration changes are required for scaling or hardware replacement, leading to low operational efficiency and high error risk.

4. New‑Requirement Support Efficiency Issue

Emerging cloud‑native requirements (e.g., richer health semantics beyond simple port connectivity) and the need to reuse the component for services outside the registry created a conflict with the legacy design.

5. Stability During Iteration Issue

The original component lacked a gray‑release mechanism; new features were deployed all at once, risking large‑scale failures.

Technical Investigation

Industry Health‑Check Solutions

We examined both registry‑center and non‑registry health‑check approaches.

Registry‑Center Health Checks

Typical solutions include SDK heartbeat reporting, SDK long‑connection with heartbeat, and centralized active health checks. Each has trade‑offs such as resource consumption, implementation complexity, or centralized load pressure.

Non‑Registry Health Checks – K8s LivenessProbe

Kubernetes provides native distributed health checks (sidecar mode). Compared with centralized checks, K8s probes run on the same node as the container, support multiple protocols (TCP, HTTP, exec, gRPC), and can kill unhealthy containers, but they lack fallback strategies and are not directly applicable to our mixed physical‑machine environment.

Choosing a Solution for Our Company

Given a heterogeneous stack (Java, Go, PHP, C++), we favored a 瘦SDK approach to keep costs low. Consequently, we excluded the SDK long‑connection + heartbeat scheme and also ruled out K8s LivenessProbe because it would terminate nodes and required custom fallback logic.

We ultimately selected a centralized active health‑check model, the same paradigm as the original component, but with a redesigned architecture.

Ideal State

Automatic failover.

Horizontal scalability.

Rapid support for flexible new requirements.

Stable iteration while adding new features.

Design & Development

Overall Architecture

The new component consists of four modules:

Dispatcher – fetches data from sources and dispatches tasks.

Prober – executes health‑check tasks.

Decider – decides whether to change health status based on results.

Performer – performs actions (e.g., updating the registry) according to decisions.

Modules expose interfaces externally while hiding internal implementations; data sources are abstracted via interfaces for easy replacement.

Service Discovery Model

A service name is unique within the company. A service may contain multiple clusters (physical or logical), each holding a set of addresses.

Coroutine Model Design

Go was chosen because health checks are I/O‑intensive and its goroutine scheduler fits the workload. The data source is cached in memory and refreshed every minute, while address data is fetched in real time.

Dispatcher obtains all services, then clusters, and spawns N goroutines to fetch addresses for each cluster, creating individual tasks for the Prober.

Prober maintains a queue of tasks and runs many goroutines to perform I/O‑bound health checks, forwarding results to Decider.

Decider must ensure sequential decision‑making per cluster to avoid over‑removal. It routes results to N queues based on a hash of service + cluster, guaranteeing that each cluster’s results are processed by a single goroutine.

Performer simply calls the update interface using the same goroutine that made the decision.

Horizontal Scaling & Automatic Failover

Each instance registers itself with a central coordinator (e.g., etcd) and watches other nodes' status. Tasks are sharded by hashing the service name; a node only processes tasks assigned to it, enabling dynamic scaling and automatic failover.

Small‑Traffic Mechanism

Two clusters are deployed: a normal cluster handling most services and a small‑traffic cluster handling less critical services for gray‑release testing. Configuration is shared and can be scoped by organization, service, cluster, and environment.

Extensibility

Pluggable Data Sources

Data sources are abstracted as read and write interfaces, allowing seamless integration of new sources.

Extensible Check Methods

Health checks are defined by an address and configuration. Implementations currently include TCP and HTTP; future extensions may add Dubbo, gRPC, Thrift, etc.

Filters

A responsibility‑chain pattern filters out services, clusters, or instances that should not be checked, making it easy to add or remove filtering logic.

Gray Release

Two measures were taken to replace the old component smoothly:

A degradation switch configurable by organization, service, cluster, and environment, with three levels: no degradation, partial degradation (only recovery), and full degradation (checks run but no removal or recovery).

Gradual migration using the small‑traffic design: gray‑released services use the new component, others continue with the old one until full migration.

Pitfalls & Optimizations

During gray release, large clusters (over 1000 nodes) caused sequential decision processing to exceed the target 3‑second round‑trip. First optimization filtered out disabled machines during task dispatch, reducing unnecessary checks.

In production, disabled‑machine filtering had limited effect. The second optimization classified results: healthy results were dispatched randomly to any queue, while unhealthy results were routed deterministically (service + cluster) to ensure ordered decision‑making. This balanced queue load and preserved correctness.

Conclusion

The article presented the background of service health checks, identified shortcomings of the legacy component, surveyed industry solutions, selected a suitable approach, and designed a more robust system covering the full development lifecycle from design to gray‑release. It emphasizes that system design is a trade‑off process; the best solution fits the specific business context rather than blindly copying others.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend architecture operational reliability go concurrency service health check

Written by

Xiao Lou's Tech Notes

Backend technology sharing, architecture design, performance optimization, source code reading, troubleshooting, and pitfall practices

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.