Cloud Native 29 min read

Taming etcd Instability: Lessons from Managing Million‑Node Kubernetes Clusters

This article details how Tencent Cloud’s TKE team identified, analyzed, reproduced, and resolved multiple etcd stability and performance issues—including data inconsistency, memory leaks, mvcc deadlocks, and WAL crashes—while sharing the lessons learned and the optimizations applied to support million‑node Kubernetes deployments.

Cloud Native Technology Community
Cloud Native Technology Community
Cloud Native Technology Community
Taming etcd Instability: Lessons from Managing Million‑Node Kubernetes Clusters

Background and Challenges

With rapid growth of Tencent’s self‑built cloud and public‑cloud users, the number of TKE (Tencent Kubernetes Engine) services and CPU cores surged. Various container service types (managed clusters, independent clusters, EKS, edge, mesh, serverless knative) all rely on Kubernetes, whose core storage component is etcd. Tencent manages thousands of etcd clusters that back tens of thousands of Kubernetes clusters.

Key stability risks stem from legacy etcd architecture, performance limits in certain scenarios, insufficient test coverage, lax change management, incomplete monitoring, and lack of automated health checks.

Stability Optimization Cases

Data Inconsistency

Two critical bugs caused data inconsistency:

During etcd restart, an authorization interface replayed stale version numbers, leading to divergent data across nodes.

During version upgrade with authentication enabled, a mismatch in lease revoke permissions between v3.2 and v3.3 caused key‑count and MVCC version divergence.

Both bugs were reproduced using chaos‑monkey style fault injection, logged extensively, and fixed via PRs that were merged into etcd v3.4.9 and v3.3.22. Additional consistency alerts (revision and key‑count differences) were added.

Memory Leak (OOM)

In March, a follower node’s memory grew to 23 GB while the leader stayed at 4 GB. Investigation revealed that the follower failed to clean up lease entries from a heap, causing a leak across all 3.4 versions. The fix removed lease‑heap maintenance from followers and rebuilt the heap on leader election. The change landed in etcd v3.4.6+.

Mvcc Deadlock

During load testing, a node hung and could not recover. Tracing showed a deadlock between the snapshot‑loading goroutine (holding the mvcc lock) and a background key‑compression goroutine that also required the same lock. The issue affected all etcd 3.x versions under heavy write load. A PR fixing the lock ordering was merged into v3.3.21 and v3.4.8.

WAL Crash (Panic)

A CRC‑mismatch panic appeared after a new WAL validation logic was introduced upstream. The mismatch occurred only after the first WAL file was recycled, leading to crashes in long‑running clusters. Adding proper CRC handling and tests resolved the issue in etcd v3.4.9 and v3.3.22.

Quota & QoS Measures

To prevent overload from expensive reads/writes (full keyspace scans, massive event listings, etc.), the team applied multiple safeguards:

Kubernetes apiserver rate limits (e.g., 100 writes/s, 200 reads/s).

Resource quotas for Pods, ConfigMaps, CRDs.

Controller‑manager termination‑Pod‑GC thresholds.

Segregating event/configmap data into separate etcd clusters.

Admission webhooks to throttle event traffic.

Dynamic TTL adjustments for events.

Prototype QoS rules based on request type, key prefix, traffic, CPU, latency.

Comprehensive multi‑dimensional alerts (traffic, memory, QPS spikes).

These measures reduced the impact of “expensive” operations and helped detect abnormal client behavior early.

Performance Optimization Cases

Startup Time & Key‑Count Query

When the DB size reached 4 GB with millions of keys, a node restart took up to 5 minutes and key‑count queries timed out after 21 seconds, consuming extra memory. Analysis showed:

Key‑count was implemented by scanning the entire B‑tree and storing revisions in a slice, causing heavy allocations.

Introducing a lightweight CountRevision eliminated the slice and cut query time from 21 s to 7 s with no extra memory.

Pushing the limit parameter down to the index layer improved limited‑record queries by orders of magnitude.

Startup profiling revealed that 9 % of the time was spent opening the backend DB (mmap), while 91 % was spent rebuilding the in‑memory B‑tree. Optimizing the consistent‑index implementation reduced total startup time from ~5 minutes to ~2 minutes 30 seconds.

Password Authentication Performance

High concurrency caused authentication to become a bottleneck: bcrypt hashing locked for ~60 ms per request, leading to 5 s+ latencies. By narrowing the lock scope and reducing bcrypt cost, throughput increased from ~18 ops/s to ~202 ops/s on an 8‑core machine—a 12× improvement. The fix was merged into etcd v3.4.9.

Summary

The team described how they discovered, reproduced, and solved a range of etcd stability and performance challenges in massive Kubernetes environments, contributed patches upstream, and distilled best practices for monitoring, testing, change management, and resource protection. Ongoing work includes automating safe upgrades, enhancing backup mechanisms, and further integrating with Kubernetes API Priority and Fairness for fine‑grained request throttling.

References

v3.4.9: https://github.com/etcd-io/etcd/releases/tag/v3.4.9

v3.3.22: https://github.com/etcd-io/etcd/releases/tag/v3.3.22

Kubernetes issue & PR: https://github.com/kubernetes/kubernetes/issues/91266

grpc crash issue: https://github.com/etcd-io/etcd/issues/9956

grpc crash PR: https://github.com/grpc/grpc-go/pull/2695

WAL crash issue: https://github.com/etcd-io/etcd/issues/11918

API Priority and Fairness: https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/20190228-priority-and-fairness.md

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cloud-nativeKubernetesstabilitydistributed storageetcd
Cloud Native Technology Community
Written by

Cloud Native Technology Community

The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.