How Alibaba Cloud Manages Over 10,000 Kubernetes Clusters at Double‑11 Scale
This article explains how Alibaba Cloud Container Service (ACK) designs a unit‑based, tiered management system, capacity planning model, global observability architecture, and pluggable components to reliably operate more than ten thousand diverse Kubernetes clusters during the massive Double‑11 shopping event.
Background
During the 2019 Double‑11 shopping festival, Alibaba Cloud Container Service (ACK) managed more than 10,000 Kubernetes clusters worldwide, supporting internal core systems and Alibaba Cloud products.
Challenges
Multiple cluster types (standard, serverless, AI, bare‑metal, edge, Windows) with distinct parameters and hosting requirements.
Cluster sizes ranging from a few nodes to tens of thousands, with rapid growth in the number of services.
Security and compliance across regions (e.g., GDPR in Europe, Chinese regulatory tiers).
Continuous evolution of Kubernetes versions and features.
Design Goals
Unit‑based tiered management with capacity planning and water‑level control.
Global deployment, release, disaster recovery, and observability.
Pluggable, customizable, modular architecture for continuous evolution.
Unit‑Based Tiered Management
ACK treats a regional meta‑cluster as a “unit”. Each unit aggregates thousands of guest clusters. Masters are distributed across multiple data centers to achieve same‑city multi‑active resilience and millisecond‑level inter‑master latency.
A tiered capacity model assigns a “grade” to each guest cluster based on resource‑type factors, enabling intelligent scaling with roughly 40 % headroom.
Capacity Planning
The network uses Alibaba‑developed high‑performance container network Terway with ENI for VPC connectivity. Separate IP ranges are allocated for nodes, pods, and services. Multi‑factor calculations (cost, density, performance, quota, grade ratio) determine how many guest clusters a meta‑cluster can host.
Global Observability
Prometheus Federation is deployed across 20 regions. Edge Prometheus runs inside each meta‑cluster, cascading Prometheus aggregates data per large region, and a dual‑active central Prometheus provides a global view and alerting.
Collected metrics include OS resources, master components, kubernetes‑state‑metrics, cAdvisor, and etcd. AlertManager forwards alerts to DingTalk, email, SMS, etc.
Monitoring traffic is exposed via LoadBalancer services to keep API‑server load low.
Pluggable Architecture
Components are modular. OpenKruise’s BroadcastJob enables rolling upgrades or health checks on every node (similar to DaemonSet but with a finite lifespan).
Multiple cluster profiles and reusable templates allow users to select configurations that match specific scenarios (standard, serverless, AI, bare‑metal, edge, Windows).
Cluster Templates and Profiles
ACK provides a library of cluster profiles and templates that encapsulate best‑practice settings for different workloads, simplifying provisioning and ensuring consistency across thousands of clusters.
Monitoring Optimizations
Separate monitoring traffic from API‑server traffic by exposing Prometheus via LoadBalancer services.
Central Prometheus scrapes only required metrics to reduce network load.
Labels on cascading Prometheus instances identify region and meta‑cluster; unnecessary labels are omitted to save bandwidth.
Conclusion
The unit‑based, tiered, and globally observable design enables ACK to reliably manage over 10,000 Kubernetes clusters, supporting large‑scale workloads with high availability, extensibility, and automated lifecycle management.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
