Cloud Native 25 min read

How Alibaba Scaled Kubernetes to 10,000 Nodes: Key Optimizations and Lessons

This article details how Alibaba and Ant Financial tackled the performance and stability challenges of running Kubernetes at massive scale, describing their enhancements to etcd, API Server, scheduler, controller failover, load balancing, and other components that enabled a 10k‑node cluster to support the 2019 Tmall 618 promotion.

Alibaba Cloud Developer

Oct 16, 2019

How Alibaba Scaled Kubernetes to 10,000 Nodes: Key Optimizations and Lessons

Background

Since Alibaba's earliest AI system in 2013, its cluster management has evolved through several architectures, fully adopting Kubernetes in 2018. Rather than discussing why Kubernetes won, this article focuses on the problems encountered at large scale and the key optimizations made.

Scale Challenges

Alibaba and Ant Financial run over 10k containerized applications, with millions of containers across hundreds of thousands of hosts. Their largest clusters contain tens of thousands of nodes, and scaling Kubernetes to this size presented significant challenges.

200k pods

1M objects

Using Kubemark, a simulation platform with 200 4‑core containers each running 50 Kubemark processes (simulating 10k kubelet nodes) revealed severe latency (up to 10 seconds) and instability under typical workloads.

When the cluster reached 10k nodes, all components exhibited performance issues:

etcd suffered heavy read/write latency and denial‑of‑service conditions, unable to store the massive object count.

API Server queries for pods/nodes became extremely slow, causing etcd OOM.

Controller lagged in perceiving API Server changes, leading to minutes‑long recovery after restarts.

Scheduler latency and low throughput could not meet daily operations or peak traffic.

etcd Improvements

To address these issues, Alibaba Cloud Container Platform made extensive enhancements to etcd.

First, data was offloaded to a Tair cluster, increasing storage capacity but adding operational complexity and compromising consistency.

Second, objects of different types were stored in separate etcd clusters, reducing per‑cluster data volume and improving scalability.

Third, a deep investigation revealed a bottleneck in bbolt’s page allocation algorithm; as data grew, linear page searches degraded performance.

They designed a segmented hashmap for free‑page management, enabling O(1) page lookup and efficient merging, expanding etcd storage from the recommended 2 GB to 100 GB without significant latency increase. Additional features such as etcd raft learner and fully concurrent reads were contributed to the open‑source etcd 3.4 release.

API Server Improvements

Efficient Node Heartbeats

In production, kubelet reports a 15 KB heartbeat every 10 seconds, causing two major problems:

Node updates generate ~1 GB/min of etcd transaction logs in a 10k‑node cluster.

Heartbeat handling consumes >80% of API Server CPU time.

Kubernetes introduced a built‑in Lease API, moving heartbeat data out of the node object. Kubelet now updates a small Lease object every 10 seconds, while node object updates are reduced to every 60 seconds for compatibility.

This drastically lowered API Server CPU usage and reduced etcd transaction log volume, scaling the system from ~1k to several thousand nodes (enabled by default in Kubernetes 1.14, see KEP‑0009).

API Server Load Balancing

High‑availability clusters can suffer uneven load distribution, especially during upgrades or node failures, causing one API Server to become a bottleneck.

Adding a load balancer between API Server and kubelet was explored, but simply routing traffic did not solve the issue due to TLS connection reuse. The solution involved three optimizations:

API Server returns 429 when overloaded, prompting clients to back off.

Clients rebuild connections after repeated 429 responses, periodically reshuffling connections.

During upgrades, set maxSurge=3 to avoid performance spikes.

The enhanced version achieved balanced API Server load and rapid recovery after node restarts.

List‑Watch & Cacher

List‑Watch is the core communication mechanism in Kubernetes. etcd changes are watched by the API Server via Reflector and cached for clients.

When a client disconnects, it resumes from the last known resourceVersion. However, if the server’s queue discards older updates, the client may receive a “too old version” error and must relist all data.

To mitigate this, Kubernetes introduced watch bookmarks, which act as heartbeats, allowing the server to push the latest resourceVersion even when no new events occur. This reduced full relist frequency from 3% to a few‑hundredths, dramatically improving performance (contributed by Alibaba Cloud and landed in Kubernetes 1.15).

Cacher & Indexing

Direct API Server queries to etcd lack indexing, causing massive data transfers and high memory usage due to quorum reads.

Alibaba introduced a cache‑coordinated mechanism: the API Server first obtains the current etcd version, waits for its internal reflector to catch up, and then serves the request from cache. This enables namespace/node‑name/label indexing, reducing a typical “describe node” operation from 5 seconds to 0.3 seconds in a 10k‑node cluster and providing order‑of‑magnitude gains for other queries.

Context‑Aware

API Server requests to external services (etcd, webhooks, authentication) should be cancelled when the client request ends. Without this, lingering goroutines consume resources, leading to OOM or crashes, especially during high‑frequency client aborts.

Alibaba engineers made API Server fully context‑aware (adopted in Kubernetes 1.16), improving performance and throughput.

Requests Flood Prevention

API Server lacks robust self‑protection against request floods beyond max‑inflight limits. Overload can cause OOM or crashes, especially during controller restarts or DaemonSet bugs that generate massive List Pod requests.

Two mitigation strategies were applied:

During startup, API Server rejects heavy List requests with 429 until its cache is ready.

Dynamic rate‑limiting based on User‑Agent identifies and throttles abusive components without costly identity checks.

These measures prevent cascading failures during upgrades or bugs.

Controller Failover

In a 10k‑node cluster, controllers hold nearly a million objects; restarting them can take minutes. A hot‑standby informer pre‑loads data, and the primary controller releases its leader lease during upgrade, allowing the standby to take over instantly.

This reduces controller downtime to under 2 seconds during upgrades and to ~15 seconds after crashes, also benefiting the scheduler.

Customized Scheduler

Alibaba’s custom scheduler employs two key ideas:

Equivalence classes: batch‑process similar pending requests to reduce predicate/priority evaluations.

Relaxed randomization: stop evaluating nodes once enough candidates are found, sacrificing exactness for speed.

Summary

etcd storage capacity and performance were dramatically increased via data sharding, external storage, and a segmented hashmap algorithm, enabling single etcd clusters to support massive Kubernetes deployments.

Lightweight node heartbeats, improved HA load balancing, watch bookmarks, and cache‑based indexing eliminated major List‑watch bottlenecks, making stable ten‑thousand‑node clusters feasible.

Hot‑standby controller/scheduler designs reduced service interruption to seconds, enhancing overall availability.

Alibaba’s custom scheduler achieved further gains through equivalence‑class batching and relaxed randomization.

These enhancements allowed Alibaba and Ant Financial to run their core e‑commerce workloads on Kubernetes clusters exceeding ten thousand nodes and successfully handle the 2019 Tmall 618 promotion.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native Kubernetes API Server Etcd Large‑scale

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.