How Nacos Guarantees High Availability for Service Registries
This article explains how Nacos achieves high availability through client retry mechanisms, the distro consistency protocol, local cache failover, heartbeat synchronization, and robust cluster deployment strategies, providing a comprehensive guide for selecting a reliable service registry in microservice architectures.
Preface
Service registration and discovery is a long‑standing topic. In the early days of Dubbo, Zookeeper was the default registry and many equated a registry with Zookeeper. Later, Spring Cloud introduced Eureka, and Alibaba launched Nacos as a new registry.
Kirito's considerations when choosing a registry: open source, active community, strong features, stability backed by large‑scale usage, cloud‑native and security features.
This article focuses on Nacos high availability, aiming to give readers a deeper understanding of Nacos.
High Availability Overview
What does high availability mean?
System availability reaching 99.99%
Partial node failures do not affect overall operation
Server deployed as a cluster of multiple nodes
Nacos achieves high availability not only on the server side but also on the client side and through related features.
Client Retry
In a microservice architecture there are three roles: Consumer, Provider and Registry. In Nacos the Registry is the nacos‑server, while Consumer and Provider are nacos‑clients.
In production we usually deploy a Nacos cluster and configure Dubbo with the cluster address:
<dubbo:registry protocol="nacos" address="192.168.0.1:8848,192.168.0.2:8848,192.168.0.3:8848"/>If one machine goes down, the client will retry the remaining addresses until a request succeeds.
The retry logic is implemented on the nacos‑client side.
Consistency Protocol distro
The article does not dive into the implementation of the consistency protocol but explains its relevance to high availability. Nacos distinguishes two service types: Ephemeral and Persistent.
Ephemeral services are removed from the list after a health‑check failure, typically used for service registration discovery.
Persistent services are marked unhealthy after a health‑check failure, commonly used for DNS scenarios.
Ephemeral services use a private protocol called distro with an AP consistency model, while Persistent services use Raft with a CP model. Therefore Nacos is not “AP + CP”; the model depends on the service type.
The distro protocol works as follows:
Nacos nodes synchronize all data from remote nodes on startup.
Each node can handle write requests and propagates new data to other nodes.
Each node periodically sends a checksum of its responsible data to other nodes to keep consistency.
When a node fails, its responsibilities are transferred to other nodes, preserving cluster availability.
Local Cache Failover Mechanism
If the entire server side fails, Nacos still provides high availability through a local cache.
Dubbo stores a copy of service addresses in memory, which also serves as a fallback when the registry is unavailable. The client also writes a snapshot to disk at {USER_HOME}/nacos/naming/.
Enable the parameter namingLoadCacheAtStart=true when constructing NacosNaming. Dubbo 2.7.4+ supports this parameter via dubbo.registry.address=nacos://127.0.0.1:8848?namingLoadCacheAtStart=true .
In production we recommend turning on this parameter to avoid service unavailability when the registry crashes. The failover directory contains files that can be manually edited for extreme scenarios.
Heartbeat Synchronization Service
Heartbeats are widely used in distributed systems to confirm liveness. Nacos includes full service information in each heartbeat to improve availability.
If all server nodes crash, the heartbeat can recreate services when the servers recover.
If a network partition occurs, the heartbeat can still create services, preserving basic availability.
Tested on an Alibaba Cloud MSE Nacos cluster, deleting a service via OpenAPI and observing automatic re‑registration after a few seconds.
Cluster Deployment Mode High Availability
Nacos high availability also depends on deployment architecture.
Node Count
For production clusters a single node is insufficient. The distro protocol works with ≥2 nodes, while Raft recommends 2n + 1 nodes. Three nodes are the minimum; five or more improve throughput and resilience.
Multi‑AZ Deployment
Nodes should have low network latency and be spread across different availability zones to avoid single‑point failures.
Deployment Mode
Two main modes: ECS (simple three‑machine cluster) and Kubernetes (cloud‑native, self‑healing). Because Nacos is stateful, Kubernetes deployments typically use StatefulSet and an Operator.
MSE Nacos High‑Availability Best Practices
When creating a multi‑node cluster, MSE automatically distributes nodes across different AZs, providing transparent high availability.
MSE runs Nacos on Kubernetes; if a node crashes, Kubernetes quickly recreates it, often unnoticed by users.
Example: delete a pod in a three‑node cluster and observe automatic leader election and node recovery within minutes.
Conclusion
The article summarizes how Nacos ensures high availability from client retry mechanisms, consistency protocols, local cache failover, heartbeat synchronization, and deployment strategies. These guarantees go beyond what Zookeeper offers, making Nacos an excellent choice for service‑registry selection.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
