Cloud Native 18 min read

How Alibaba Scaled Kubernetes to 10,000 Nodes: Key Optimizations and Lessons

This article details Alibaba's experience deploying Kubernetes at massive scale, describing the performance bottlenecks encountered in etcd, API server, controller, and scheduler components, and presenting the concrete engineering improvements—such as storage sharding, lease‑based heartbeats, load‑balancing, watch bookmarks, and hot‑standby controllers—that enabled stable operation of clusters with tens of thousands of nodes.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
How Alibaba Scaled Kubernetes to 10,000 Nodes: Key Optimizations and Lessons

Background

Alibaba migrated its production workloads to Kubernetes in 2018, eventually operating clusters with over 10,000 nodes and millions of containers to support major e‑commerce events such as the 2019 Tmall 618 promotion. The article focuses on the technical challenges of scaling Kubernetes and the engineering solutions applied.

Scale Estimation and Simulation

To anticipate bottlenecks, Alibaba estimated a 10k‑node cluster would host roughly 200 k pods and 1 M objects. Using Kubemark, they built a simulation platform where 200 containers each ran 50 Kubemark processes, emulating 10k kubelets. The simulation revealed pod‑scheduling latencies of up to 10 seconds and overall instability.

Observed Component Bottlenecks at 10k Nodes

etcd suffered severe read/write latency, frequent OOM, and storage limits.

API Server queries for pods/nodes incurred high latency and could cause etcd OOM.

Controllers experienced delayed state perception and minutes‑long recovery after crashes.

Scheduler latency and throughput were insufficient for peak traffic.

etcd Improvements

Three major versions of enhancements were introduced:

Version 1 moved etcd data to a Tair cluster, increasing capacity but adding operational complexity and weaker consistency.

Version 2 partitioned objects across multiple etcd clusters, reducing per‑cluster data volume.

Version 3 redesigned the bbolt page allocation algorithm using a segregated hashmap, achieving O(1) free‑page lookup and enabling etcd storage growth from the recommended 2 GB to 100 GB without noticeable latency increase. Additional features such as raft learners and fully concurrent reads were contributed upstream and landed in etcd 3.4.

API Server Improvements

Efficient Node Heartbeats

Kubelet originally sent a 15 KB heartbeat every 10 seconds, generating ~1 GB/min transaction logs in etcd and consuming >80% of API Server CPU. Alibaba adopted the built‑in Lease API to decouple heartbeat state from the node object, updating a lightweight Lease every 10 seconds and the node object only every 60 seconds. This reduced CPU load and transaction log volume, and the feature became default in Kubernetes 1.14.

API Server Load Balancing

During upgrades or node failures, traffic could concentrate on a single API Server, causing CPU spikes. Alibaba experimented with two load‑balancing patterns (LB in front of API Server vs. LB in front of kubelet) and found they did not fully solve the issue. Instead, they added server‑side overload protection: when CPU exceeds a threshold, the API Server returns 409 Too Many Requests and eventually closes connections. Clients back‑off or periodically rebuild connections, and upgrades use maxSurge=3 to smooth performance.

List‑Watch & Bookmark

The List‑Watch mechanism uses a global resourceVersion to track changes. When the server’s storage queue discards older entries, clients may receive a “too old version” error and must relist. Alibaba introduced a Watch bookmark that periodically updates the server’s version even without data changes, reducing unnecessary relists and cutting API Server restart sync time from minutes to a few seconds. This feature shipped in Kubernetes 1.15.

Cacher & Indexing

Direct queries to the API Server suffered from lack of indexing and large etcd reads. Alibaba designed a cache‑coordinated workflow: the API Server first obtains the current etcd version, waits for its Reflector to catch up, then serves the request from cache. By adding namespace‑node‑name indexes, describe‑node latency dropped from ~5 seconds to 0.3 seconds at ten‑thousand‑node scale, and other get‑operations saw order‑of‑magnitude speedups.

Controller Failover

Controllers handling millions of objects required minutes to recover after a restart. Alibaba implemented a hot‑standby informer that pre‑loads data, and the active controller voluntarily releases its leader lease during upgrade, allowing the standby to take over instantly. This reduced controller downtime to under 2 seconds and limited leader‑lease expiration to 15 seconds on unexpected failures, also benefiting the scheduler.

Customized Scheduler (Brief)

Although not fully detailed, two optimization ideas were shared:

Group pending pod requests into equivalence classes to batch predicate and priority evaluation.

Apply relaxed randomization: stop evaluating all nodes once a sufficient candidate set is found, trading exactness for speed.

Summary of Enhancements

Extended etcd capacity via sharding, storage separation, and a new bbolt page allocation algorithm, supporting massive clusters with a single etcd instance.

Implemented lightweight node heartbeats, improved HA API Server load distribution, added watch bookmarks, and introduced cache‑based indexing to eliminate List‑Watch bottlenecks.

Deployed hot‑standby controllers and schedulers, cutting failover time to seconds and improving overall availability.

Optimized the custom scheduler using equivalence‑class batching and randomization relaxation.

These combined improvements enabled Alibaba to run stable Kubernetes clusters with tens of thousands of nodes, successfully handling the 2019 Tmall 618 shopping festival.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceKubernetesAPI Serveretcd
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.