Operations 21 min read

How Ant Group Scales etcd for 10k‑Node Kubernetes Clusters: High‑Availability Secrets

This article examines Ant Group's strategies for achieving high availability of the etcd key‑value store in a massive 10,000‑node Kubernetes cluster, detailing challenges, performance metrics, filesystem upgrades, tuning parameters, operational platform insights, and future directions for distributed etcd deployments.

Efficient Ops
Efficient Ops
Efficient Ops
How Ant Group Scales etcd for 10k‑Node Kubernetes Clusters: High‑Availability Secrets

Ant Group operates one of the world’s largest Kubernetes clusters, exceeding the official 5k‑node limit and reaching 10k nodes. The article focuses on the high‑availability construction of the etcd layer, the key KV store that underpins the cluster.

Challenges

etcd is the KV database for the cluster and serves multiple roles (etcd cluster, kube‑apiserver proxy, kubelet producer/consumer, controller‑manager, scheduler). In a >7k‑node cluster the KV size, event volume and read/write pressure explode, leading to minutes‑level latency, OOM, and long list‑all operations.

KV data > 1M entries

Event data > 100k entries

Read QPM > 300k, event read > 10k QPM

Write QPM > 200k, event write > 15k QPM

CPU usage often > 900%

Memory RSS > 60 GiB

Disk usage > 100 GiB

Goroutine count > 9k

User‑space threads > 1.6k

GC latency ~15 ms

High‑Availability Strategies

Improving distributed system availability can be done by:

Enhancing stability and performance of the component itself.

Fine‑grained upstream traffic management.

Ensuring downstream service SLOs.

etcd already provides solid stability; Ant Group further improves resource utilization through scale‑out and scale‑up techniques.

Filesystem Upgrade

Switching from SATA disks to NVMe SSDs raised random write speed to >70 MiB/s. Using

tmpfs

for the event etcd cluster added another ~20% performance gain. Changing the underlying filesystem from ext4 to XFS with a 16 KiB block size gave modest write improvements, while further gains now come from memory indexing.

etcd Tuning

Key tunable parameters include write batch size (default 10k) and interval (100 ms), and compaction settings (interval, sleep interval, batch limit). In large clusters the batch values should be reduced proportionally to node count, and compaction intervals can be lengthened (up to 1 hour) to avoid frequent pauses.

Compaction can be delegated to the kube‑apiserver layer, allowing dynamic adjustment based on traffic patterns (e.g., longer intervals during low‑load periods).

Operations Platform

The platform provides analysis functions such as longest N KV, top N KV, top N namespace, verb‑resource statistics, connection counts, client source stats, and redundant data analysis. Based on these insights Ant Group performs client rate‑limiting, load balancing, cluster splitting, redundant data deletion, and fine‑grained traffic analysis.

Splitting the event data into a separate etcd cluster reduces KV size and external traffic, improving RT and QPS. Removing cold data that has not been accessed for weeks can cut memory key count by ~20% and halve latency.

Future Roadmap

Beyond scale‑up, Ant Group plans to explore distributed etcd clusters (scale‑out) using both proxy‑based and proxy‑less designs. Proxy‑less clusters avoid the 20‑25% performance penalty of proxy routing. Additional work includes tracking upstream etcd features, optimizing compaction algorithms, and integrating multiboltdb architectures.

References: https://www.kubernetes.org.cn/9284.html ; https://tech.meituan.com/2020/08/13/openstack-to-kubernetes-in-meituan.html

High AvailabilitykubernetesPerformance TuningLarge ScaleEtcd
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.