How a Misconfigured Nacos Cluster Cost $170 Million: A Deep P0 Incident Postmortem
A leading financial platform suffered a six‑hour outage and $170 million loss when its Nacos service‑registry cluster entered a split‑brain state due to network partition, exposing flaws in AP‑mode deployment, monitoring gaps, and cascading failures that were later resolved through Raft migration, multi‑active architecture, and client‑side resilience.
Incident Overview
On 2024‑01‑18 at 09:17 AM, a top‑tier A‑financial platform experienced a severe outage when its Nacos 2.x service‑registry cluster suffered a brain‑split caused by a network cable cut. The core credit system was unavailable for over six hours, resulting in a direct economic loss of $170 million and a 13.1 % YoY revenue decline.
Technical Root Cause
1. Deployment and Protocol Issues
The Nacos cluster was deployed across three nodes in two availability zones (AZ1 with one node, AZ2 with two nodes) using the Distro protocol in AP mode. Distro favours availability and does not enforce a strict quorum, making it vulnerable to split‑brain scenarios.
When a third‑party construction accidentally severed the backbone fiber, AZ1 became isolated with a single node that automatically promoted itself to leader, while AZ2 formed a majority partition and elected another leader. Both partitions continued to accept client requests, leading to divergent service‑instance metadata.
Key configuration parameters such as nacos.core.protocol.raft.election_timeout_ms were set for AP mode, and essential monitoring of leader count and partition status was missing.
2. Data Inconsistency and Service‑List Divergence
During the three‑minute brain‑split, newly started instances registered with the local partition, causing different service lists. For example, the risk‑control service had instances 10.0.1.101, 10.0.1.102 in the majority partition but only 10.0.1.103 in the minority partition.
Clients connected to AZ1 received stale or invalid instances, causing a 40 % request‑failure rate and triggering cascading failures across credit approval, payment routing, and risk‑control modules.
3. Avalanche Effect
The inconsistent data persisted for about five minutes, leading to four failure stages:
Direct call failures (9:17‑9:30) : timeout rate rose from 0.1 % to 89 %.
Circuit‑breaker activation (9:30‑9:33) : API gateway rejected requests; payment routing capacity dropped 60 %.
Resource exhaustion (9:33‑9:40) : retry storms, DB connection‑pool depletion, memory leaks caused OOM.
Full system avalanche (9:40‑9:50) : configuration version mismatch produced false‑positive risk assessments and health‑check failures.
Root‑Cause Summary
The outage resulted from a combination of protocol misuse (AP/Distro instead of CP/Raft), unbalanced AZ node distribution, lack of cross‑AZ redundancy, missing split‑brain detection, and absent monitoring of critical metrics such as leader count and data‑sync latency.
Remediation Measures
1. Architecture Redesign – Multi‑Active Deployment
Adopt a three‑region, five‑center active‑active Nacos architecture with dedicated MySQL clusters and DRBD storage replication. Use two dedicated network links for cross‑region sync and introduce the Nacos‑Sync component for bidirectional incremental synchronization.
2. Protocol Upgrade to Raft (CP)
Enable Raft consensus by setting nacos.core.protocol.raft.enabled=true, increasing election timeout to 30 s, and enabling anti‑split‑brain detection. Raft ensures that only the majority partition can elect a leader, preventing dual‑leader scenarios.
# Nacos Raft core configuration
nacos.core.protocol.raft.enabled=true
nacos.core.protocol.raft.election_timeout_ms=30000
nacos.core.protocol.raft.heartbeat_interval_ms=5000
nacos.core.protocol.raft.anti_split_brain_enabled=true3. Client‑Side Resilience
Implement dual‑registry writes to primary and backup clusters, local cache with persistent disk fallback, and multi‑level circuit breakers (service‑level, component‑level, global‑level). Use static‑route “escape pods” when registration centers are unavailable.
public class DualRegistry implements ServiceRegistry {
private List<NamingService> clusters = List.of(
NacosFactory.createNamingService("bj-cluster:8848"),
NacosFactory.createNamingService("sh-cluster:8848"));
public void register(Instance instance) {
clusters.forEach(ns -> {
try { ns.registerInstance(instance.getServiceName(), instance); }
catch (NacosException e) { log.warn("Registry failed {}", ns.getServerAddr()); }
});
}
}Local‑cache implementation stores service lists in memory and persists them to disk, falling back to the cache when Nacos is unreachable.
public class LocalCacheServiceRegistry implements ServiceRegistry {
private final ConcurrentMap<String, List<ServiceInstance>> serviceCache = new ConcurrentHashMap<>();
private final ServiceRegistry delegate;
public List<ServiceInstance> getInstances(String serviceId) {
try { List<ServiceInstance> instances = delegate.getInstances(serviceId); updateCache(serviceId, instances); return instances; }
catch (Exception e) { log.warn("Nacos down, using cache", e); return getFromCache(serviceId); }
}
// cache update and retrieval omitted for brevity
}4. Operations Enhancements
Upgrade monitoring to include leader heartbeat, partition status, and write‑quorum metrics. Extend network time‑outs, shorten heartbeat intervals, and add DNS cache tuning. Institutionalise chaos‑engineering drills that simulate network partitions, node failures, and registry outages.
# Example chaos‑blade script for quarterly disaster‑recovery drill
chaosblade create network partition --time 180 --interface eth0 \
--local-port 8848 --remote-port 7848 --percent 100
stress-ng --cpu 8 --timeout 300s &
monitor collect --duration 5m --output chaos-report-$(date +%Y%m%d).jsonIndustry Impact
The incident prompted Alibaba Cloud to release an anti‑split‑brain patch for Nacos, the China Banking Association published a technical specification for registration‑center deployment, and the platform open‑sourced its local‑cache component as a reference implementation.
Takeaways
Choosing the appropriate consistency protocol, balancing AZ node distribution, implementing robust monitoring, and preparing automated failover are essential for building financially‑critical, high‑availability microservice systems.
Tech Freedom Circle
Crazy Maker Circle (Tech Freedom Architecture Circle): a community of tech enthusiasts, experts, and high‑performance fans. Many top‑level masters, architects, and hobbyists have achieved tech freedom; another wave of go‑getters are hustling hard toward tech freedom.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
