How Alibaba Scaled Kubernetes to 10,000 Nodes: Key Optimizations and Lessons
This article details Alibaba's experience deploying Kubernetes at massive scale, describing the performance bottlenecks encountered in etcd, API server, controller, and scheduler components, and presenting the concrete engineering improvements—such as storage sharding, lease‑based heartbeats, load‑balancing, watch bookmarks, and hot‑standby controllers—that enabled stable operation of clusters with tens of thousands of nodes.
Background
Alibaba migrated its production workloads to Kubernetes in 2018, eventually operating clusters with over 10,000 nodes and millions of containers to support major e‑commerce events such as the 2019 Tmall 618 promotion. The article focuses on the technical challenges of scaling Kubernetes and the engineering solutions applied.
Scale Estimation and Simulation
To anticipate bottlenecks, Alibaba estimated a 10k‑node cluster would host roughly 200 k pods and 1 M objects. Using Kubemark, they built a simulation platform where 200 containers each ran 50 Kubemark processes, emulating 10k kubelets. The simulation revealed pod‑scheduling latencies of up to 10 seconds and overall instability.
Observed Component Bottlenecks at 10k Nodes
etcd suffered severe read/write latency, frequent OOM, and storage limits.
API Server queries for pods/nodes incurred high latency and could cause etcd OOM.
Controllers experienced delayed state perception and minutes‑long recovery after crashes.
Scheduler latency and throughput were insufficient for peak traffic.
etcd Improvements
Three major versions of enhancements were introduced:
Version 1 moved etcd data to a Tair cluster, increasing capacity but adding operational complexity and weaker consistency.
Version 2 partitioned objects across multiple etcd clusters, reducing per‑cluster data volume.
Version 3 redesigned the bbolt page allocation algorithm using a segregated hashmap, achieving O(1) free‑page lookup and enabling etcd storage growth from the recommended 2 GB to 100 GB without noticeable latency increase. Additional features such as raft learners and fully concurrent reads were contributed upstream and landed in etcd 3.4.
API Server Improvements
Efficient Node Heartbeats
Kubelet originally sent a 15 KB heartbeat every 10 seconds, generating ~1 GB/min transaction logs in etcd and consuming >80% of API Server CPU. Alibaba adopted the built‑in Lease API to decouple heartbeat state from the node object, updating a lightweight Lease every 10 seconds and the node object only every 60 seconds. This reduced CPU load and transaction log volume, and the feature became default in Kubernetes 1.14.
API Server Load Balancing
During upgrades or node failures, traffic could concentrate on a single API Server, causing CPU spikes. Alibaba experimented with two load‑balancing patterns (LB in front of API Server vs. LB in front of kubelet) and found they did not fully solve the issue. Instead, they added server‑side overload protection: when CPU exceeds a threshold, the API Server returns 409 Too Many Requests and eventually closes connections. Clients back‑off or periodically rebuild connections, and upgrades use maxSurge=3 to smooth performance.
List‑Watch & Bookmark
The List‑Watch mechanism uses a global resourceVersion to track changes. When the server’s storage queue discards older entries, clients may receive a “too old version” error and must relist. Alibaba introduced a Watch bookmark that periodically updates the server’s version even without data changes, reducing unnecessary relists and cutting API Server restart sync time from minutes to a few seconds. This feature shipped in Kubernetes 1.15.
Cacher & Indexing
Direct queries to the API Server suffered from lack of indexing and large etcd reads. Alibaba designed a cache‑coordinated workflow: the API Server first obtains the current etcd version, waits for its Reflector to catch up, then serves the request from cache. By adding namespace‑node‑name indexes, describe‑node latency dropped from ~5 seconds to 0.3 seconds at ten‑thousand‑node scale, and other get‑operations saw order‑of‑magnitude speedups.
Controller Failover
Controllers handling millions of objects required minutes to recover after a restart. Alibaba implemented a hot‑standby informer that pre‑loads data, and the active controller voluntarily releases its leader lease during upgrade, allowing the standby to take over instantly. This reduced controller downtime to under 2 seconds and limited leader‑lease expiration to 15 seconds on unexpected failures, also benefiting the scheduler.
Customized Scheduler (Brief)
Although not fully detailed, two optimization ideas were shared:
Group pending pod requests into equivalence classes to batch predicate and priority evaluation.
Apply relaxed randomization: stop evaluating all nodes once a sufficient candidate set is found, trading exactness for speed.
Summary of Enhancements
Extended etcd capacity via sharding, storage separation, and a new bbolt page allocation algorithm, supporting massive clusters with a single etcd instance.
Implemented lightweight node heartbeats, improved HA API Server load distribution, added watch bookmarks, and introduced cache‑based indexing to eliminate List‑Watch bottlenecks.
Deployed hot‑standby controllers and schedulers, cutting failover time to seconds and improving overall availability.
Optimized the custom scheduler using equivalence‑class batching and randomization relaxation.
These combined improvements enabled Alibaba to run stable Kubernetes clusters with tens of thousands of nodes, successfully handling the 2019 Tmall 618 shopping festival.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
