Cloud Native 13 min read

How Alibaba Cloud Manages Over 10,000 Kubernetes Clusters at Double‑11 Scale

This article explains how Alibaba Cloud Container Service (ACK) designs a unit‑based, tiered management system, capacity planning model, global observability architecture, and pluggable components to reliably operate more than ten thousand diverse Kubernetes clusters during the massive Double‑11 shopping event.

Alibaba Cloud Native

Nov 30, 2019

How Alibaba Cloud Manages Over 10,000 Kubernetes Clusters at Double‑11 Scale

Background

During the 2019 Double‑11 shopping festival, Alibaba Cloud Container Service (ACK) managed more than 10,000 Kubernetes clusters worldwide, supporting internal core systems and Alibaba Cloud products.

Challenges

Multiple cluster types (standard, serverless, AI, bare‑metal, edge, Windows) with distinct parameters and hosting requirements.

Cluster sizes ranging from a few nodes to tens of thousands, with rapid growth in the number of services.

Security and compliance across regions (e.g., GDPR in Europe, Chinese regulatory tiers).

Continuous evolution of Kubernetes versions and features.

Design Goals

Unit‑based tiered management with capacity planning and water‑level control.

Global deployment, release, disaster recovery, and observability.

Pluggable, customizable, modular architecture for continuous evolution.

Unit‑Based Tiered Management

ACK treats a regional meta‑cluster as a “unit”. Each unit aggregates thousands of guest clusters. Masters are distributed across multiple data centers to achieve same‑city multi‑active resilience and millisecond‑level inter‑master latency.

A tiered capacity model assigns a “grade” to each guest cluster based on resource‑type factors, enabling intelligent scaling with roughly 40 % headroom.

Capacity Planning

The network uses Alibaba‑developed high‑performance container network Terway with ENI for VPC connectivity. Separate IP ranges are allocated for nodes, pods, and services. Multi‑factor calculations (cost, density, performance, quota, grade ratio) determine how many guest clusters a meta‑cluster can host.

Global Observability

Prometheus Federation is deployed across 20 regions. Edge Prometheus runs inside each meta‑cluster, cascading Prometheus aggregates data per large region, and a dual‑active central Prometheus provides a global view and alerting.

Collected metrics include OS resources, master components, kubernetes‑state‑metrics, cAdvisor, and etcd. AlertManager forwards alerts to DingTalk, email, SMS, etc.

Monitoring traffic is exposed via LoadBalancer services to keep API‑server load low.

Pluggable Architecture

Components are modular. OpenKruise’s BroadcastJob enables rolling upgrades or health checks on every node (similar to DaemonSet but with a finite lifespan).

Multiple cluster profiles and reusable templates allow users to select configurations that match specific scenarios (standard, serverless, AI, bare‑metal, edge, Windows).

Cluster Templates and Profiles

ACK provides a library of cluster profiles and templates that encapsulate best‑practice settings for different workloads, simplifying provisioning and ensuring consistency across thousands of clusters.

Monitoring Optimizations

Separate monitoring traffic from API‑server traffic by exposing Prometheus via LoadBalancer services.

Central Prometheus scrapes only required metrics to reduce network load.

Labels on cascading Prometheus instances identify region and meta‑cluster; unnecessary labels are omitted to save bandwidth.

Conclusion

The unit‑based, tiered, and globally observable design enables ACK to reliably manage over 10,000 Kubernetes clusters, supporting large‑scale workloads with high availability, extensibility, and automated lifecycle management.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability Kubernetes Prometheus Alibaba Cloud cluster management ACK

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.