Taming etcd Instability: Lessons from Managing Million‑Node Kubernetes Clusters
This article details how Tencent Cloud’s TKE team identified, analyzed, reproduced, and resolved multiple etcd stability and performance issues—including data inconsistency, memory leaks, mvcc deadlocks, and WAL crashes—while sharing the lessons learned and the optimizations applied to support million‑node Kubernetes deployments.
Background and Challenges
With rapid growth of Tencent’s self‑built cloud and public‑cloud users, the number of TKE (Tencent Kubernetes Engine) services and CPU cores surged. Various container service types (managed clusters, independent clusters, EKS, edge, mesh, serverless knative) all rely on Kubernetes, whose core storage component is etcd. Tencent manages thousands of etcd clusters that back tens of thousands of Kubernetes clusters.
Key stability risks stem from legacy etcd architecture, performance limits in certain scenarios, insufficient test coverage, lax change management, incomplete monitoring, and lack of automated health checks.
Stability Optimization Cases
Data Inconsistency
Two critical bugs caused data inconsistency:
During etcd restart, an authorization interface replayed stale version numbers, leading to divergent data across nodes.
During version upgrade with authentication enabled, a mismatch in lease revoke permissions between v3.2 and v3.3 caused key‑count and MVCC version divergence.
Both bugs were reproduced using chaos‑monkey style fault injection, logged extensively, and fixed via PRs that were merged into etcd v3.4.9 and v3.3.22. Additional consistency alerts (revision and key‑count differences) were added.
Memory Leak (OOM)
In March, a follower node’s memory grew to 23 GB while the leader stayed at 4 GB. Investigation revealed that the follower failed to clean up lease entries from a heap, causing a leak across all 3.4 versions. The fix removed lease‑heap maintenance from followers and rebuilt the heap on leader election. The change landed in etcd v3.4.6+.
Mvcc Deadlock
During load testing, a node hung and could not recover. Tracing showed a deadlock between the snapshot‑loading goroutine (holding the mvcc lock) and a background key‑compression goroutine that also required the same lock. The issue affected all etcd 3.x versions under heavy write load. A PR fixing the lock ordering was merged into v3.3.21 and v3.4.8.
WAL Crash (Panic)
A CRC‑mismatch panic appeared after a new WAL validation logic was introduced upstream. The mismatch occurred only after the first WAL file was recycled, leading to crashes in long‑running clusters. Adding proper CRC handling and tests resolved the issue in etcd v3.4.9 and v3.3.22.
Quota & QoS Measures
To prevent overload from expensive reads/writes (full keyspace scans, massive event listings, etc.), the team applied multiple safeguards:
Kubernetes apiserver rate limits (e.g., 100 writes/s, 200 reads/s).
Resource quotas for Pods, ConfigMaps, CRDs.
Controller‑manager termination‑Pod‑GC thresholds.
Segregating event/configmap data into separate etcd clusters.
Admission webhooks to throttle event traffic.
Dynamic TTL adjustments for events.
Prototype QoS rules based on request type, key prefix, traffic, CPU, latency.
Comprehensive multi‑dimensional alerts (traffic, memory, QPS spikes).
These measures reduced the impact of “expensive” operations and helped detect abnormal client behavior early.
Performance Optimization Cases
Startup Time & Key‑Count Query
When the DB size reached 4 GB with millions of keys, a node restart took up to 5 minutes and key‑count queries timed out after 21 seconds, consuming extra memory. Analysis showed:
Key‑count was implemented by scanning the entire B‑tree and storing revisions in a slice, causing heavy allocations.
Introducing a lightweight CountRevision eliminated the slice and cut query time from 21 s to 7 s with no extra memory.
Pushing the limit parameter down to the index layer improved limited‑record queries by orders of magnitude.
Startup profiling revealed that 9 % of the time was spent opening the backend DB (mmap), while 91 % was spent rebuilding the in‑memory B‑tree. Optimizing the consistent‑index implementation reduced total startup time from ~5 minutes to ~2 minutes 30 seconds.
Password Authentication Performance
High concurrency caused authentication to become a bottleneck: bcrypt hashing locked for ~60 ms per request, leading to 5 s+ latencies. By narrowing the lock scope and reducing bcrypt cost, throughput increased from ~18 ops/s to ~202 ops/s on an 8‑core machine—a 12× improvement. The fix was merged into etcd v3.4.9.
Summary
The team described how they discovered, reproduced, and solved a range of etcd stability and performance challenges in massive Kubernetes environments, contributed patches upstream, and distilled best practices for monitoring, testing, change management, and resource protection. Ongoing work includes automating safe upgrades, enhancing backup mechanisms, and further integrating with Kubernetes API Priority and Fairness for fine‑grained request throttling.
References
v3.4.9: https://github.com/etcd-io/etcd/releases/tag/v3.4.9
v3.3.22: https://github.com/etcd-io/etcd/releases/tag/v3.3.22
Kubernetes issue & PR: https://github.com/kubernetes/kubernetes/issues/91266
grpc crash issue: https://github.com/etcd-io/etcd/issues/9956
grpc crash PR: https://github.com/grpc/grpc-go/pull/2695
WAL crash issue: https://github.com/etcd-io/etcd/issues/11918
API Priority and Fairness: https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/20190228-priority-and-fairness.md
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Cloud Native Technology Community
The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
