How Alibaba Cloud ACK Guarantees Kubernetes Cluster Stability at Massive Scale
This article explains the stability challenges of large‑scale Kubernetes clusters, outlines ACK's high‑availability architecture and component optimizations, and details product features such as Prometheus, AIOps and managed node pools that together ensure reliable, performant cloud‑native workloads.
Introduction
In July 2023, Alibaba Cloud Container Service for Kubernetes (ACK) became one of the first products to pass the China Academy of Information and Communications Technology’s "Cloud Service Stable Operation Capability – Container Cluster Stability" assessment and earned the "Advanced" certification. As ACK adoption grows, ensuring cluster stability has become a fundamental requirement.
Typical Stability Pain Points
Control‑plane interruptions : During traffic spikes, the control‑plane services may become intermittent or completely unavailable, especially when automatic scaling is lacking.
Node NotReady cascades : A batch of nodes entering NotReady can trigger a snowball effect, causing continuous business restarts.
Slow image pulls : High‑traffic periods see pod image pulls taking minutes, impacting service availability.
Operational complexity : Managing resources, tuning parameters, and upgrading the control‑plane components require extensive analysis and automation.
Kubernetes Cluster Architecture
A cloud‑native Kubernetes cluster consists of a control plane, a data plane, and the underlying cloud resources that host them. The control plane includes apiserver, etcd, scheduler, kube‑controller‑manager, and cloud‑controller‑manager. The data plane comprises node management, pod lifecycle, services, and auxiliary components such as logging, monitoring, and security agents.
Any bottleneck in these components or in the cloud‑resource links can degrade overall cluster stability.
Stability Risks and Challenges in Large‑Scale Scenarios
Massive resource counts : Clusters may exceed 10,000 nodes or host over 100,000 namespaces, configmaps, and secrets.
Control‑plane pressure : Components cache large portions of the cluster state; excessive LIST requests can exhaust memory and trigger OOM.
Data‑plane pressure and sync bottlenecks : Overloaded nodes cause kubelet slowdown or NotReady status; network saturation hampers node‑to‑control‑plane synchronization.
Cloud‑resource limits : SLB connections, bandwidth, ECS memory/CPU, etc., can become saturated, requiring high‑availability deployment across zones.
ACK Stability Governance and Optimization Strategies
1. High‑Availability Architecture
Control‑plane components are deployed across multiple Availability Zones (AZs) with zone‑level redundancy. In a three‑AZ region, ACK’s SLA for the control plane is 99.95%; in single‑AZ regions, it is 99.5%.
2. Kubernetes Component Optimizations
ACK optimizes core components (APIServer, etcd, KCM, Scheduler, Kubelet, Kube‑proxy) through parameter tuning, automatic scaling, and version upgrades.
3. Capacity Planning and Auto‑Elasticity
Encourage the use of Informer (LIST + WATCH) instead of raw LIST requests.
Prefer protobuf over JSON for non‑CRD resources to reduce traffic.
Apply VPA/HPA to control‑plane pods based on load.
4. System and User Component Optimizations
Automatic scaling of control‑plane components.
Soft load‑balancing via the Goaway feature to distribute traffic evenly.
Expose monitoring and alerts for managed components.
Clean up unused resources (ConfigMaps, Secrets, PVCs) promptly.
5. Quality Inspection, Fault Drills, and Stress‑Testing
ACK provides automated cluster inspections, regular fault‑injection drills, and a robust stress‑testing framework.
6. Data‑Plane Optimizations
Node auto‑operations (auto‑upgrade, self‑healing, security patches).
Image acceleration using DADI‑based on‑demand loading and P2P distribution.
Detailed Component Optimizations
APIServer
Automatic elasticity based on request pressure and cluster size.
Soft load‑balancing (Goaway) to avoid OOM on overloaded instances.
Observability dashboards that highlight non‑standard LIST requests.
Periodic cleanup of unused Kubernetes objects.
Etcd
Separate Data and Event etcd clusters to isolate workloads.
Apply VPA to etcd pods based on resource usage.
AutoDefrag operator monitors DB size and triggers defragmentation.
Scheduler / KCM / CCM
Increase QPS/Burst limits for large‑scale environments.
Enforce LIST requests with resourceVersion=0 to read from cache.
Standardize protobuf serialization.
Prefer Informer mechanisms to reduce control‑plane load.
Throttle client request frequency.
Introduce a horizontally scalable relay (e.g., Poseidon) for high‑traffic DaemonSet or ECI pods.
ACK Stability Product Features and Best Practices
Prometheus for ACK Pro
Provides end‑to‑end observability from application to infrastructure, with interactive dashboards covering global resources, node health, core component metrics, cluster events, and eBPF‑based non‑intrusive application monitoring.
Container AIOps Suite
Leverages a knowledge‑base and large language models to deliver:
Full‑stack inspections that surface resource quotas, watermarks, and remediation steps.
Pre‑upgrade checks that detect deprecated APIs and insufficient resources.
Intelligent diagnostics for Pods, Nodes, Ingress, Services, network, and memory anomalies.
Managed Node Pool
Offers four key capabilities:
Automatic kubelet and node‑component upgrades.
Self‑healing of NotReady nodes.
Security patching and kernel hardening.
Rapid elasticity (e.g., 1,000 nodes added in ~55 seconds under P90 workload).
Outlook
ACK’s stability platform will continue to evolve, delivering ongoing improvements in security, performance, cost efficiency, and overall reliability for cloud‑native workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
