Cloud Native 16 min read

How Alibaba Cloud ACK Guarantees Kubernetes Cluster Stability at Massive Scale

This article explains the stability challenges of large‑scale Kubernetes clusters, outlines ACK's high‑availability architecture and component optimizations, and details product features such as Prometheus, AIOps and managed node pools that together ensure reliable, performant cloud‑native workloads.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
How Alibaba Cloud ACK Guarantees Kubernetes Cluster Stability at Massive Scale

Introduction

In July 2023, Alibaba Cloud Container Service for Kubernetes (ACK) became one of the first products to pass the China Academy of Information and Communications Technology’s "Cloud Service Stable Operation Capability – Container Cluster Stability" assessment and earned the "Advanced" certification. As ACK adoption grows, ensuring cluster stability has become a fundamental requirement.

Typical Stability Pain Points

Control‑plane interruptions : During traffic spikes, the control‑plane services may become intermittent or completely unavailable, especially when automatic scaling is lacking.

Node NotReady cascades : A batch of nodes entering NotReady can trigger a snowball effect, causing continuous business restarts.

Slow image pulls : High‑traffic periods see pod image pulls taking minutes, impacting service availability.

Operational complexity : Managing resources, tuning parameters, and upgrading the control‑plane components require extensive analysis and automation.

Kubernetes Cluster Architecture

A cloud‑native Kubernetes cluster consists of a control plane, a data plane, and the underlying cloud resources that host them. The control plane includes apiserver, etcd, scheduler, kube‑controller‑manager, and cloud‑controller‑manager. The data plane comprises node management, pod lifecycle, services, and auxiliary components such as logging, monitoring, and security agents.

Any bottleneck in these components or in the cloud‑resource links can degrade overall cluster stability.

Stability Risks and Challenges in Large‑Scale Scenarios

Massive resource counts : Clusters may exceed 10,000 nodes or host over 100,000 namespaces, configmaps, and secrets.

Control‑plane pressure : Components cache large portions of the cluster state; excessive LIST requests can exhaust memory and trigger OOM.

Data‑plane pressure and sync bottlenecks : Overloaded nodes cause kubelet slowdown or NotReady status; network saturation hampers node‑to‑control‑plane synchronization.

Cloud‑resource limits : SLB connections, bandwidth, ECS memory/CPU, etc., can become saturated, requiring high‑availability deployment across zones.

ACK Stability Governance and Optimization Strategies

1. High‑Availability Architecture

Control‑plane components are deployed across multiple Availability Zones (AZs) with zone‑level redundancy. In a three‑AZ region, ACK’s SLA for the control plane is 99.95%; in single‑AZ regions, it is 99.5%.

2. Kubernetes Component Optimizations

ACK optimizes core components (APIServer, etcd, KCM, Scheduler, Kubelet, Kube‑proxy) through parameter tuning, automatic scaling, and version upgrades.

3. Capacity Planning and Auto‑Elasticity

Encourage the use of Informer (LIST + WATCH) instead of raw LIST requests.

Prefer protobuf over JSON for non‑CRD resources to reduce traffic.

Apply VPA/HPA to control‑plane pods based on load.

4. System and User Component Optimizations

Automatic scaling of control‑plane components.

Soft load‑balancing via the Goaway feature to distribute traffic evenly.

Expose monitoring and alerts for managed components.

Clean up unused resources (ConfigMaps, Secrets, PVCs) promptly.

5. Quality Inspection, Fault Drills, and Stress‑Testing

ACK provides automated cluster inspections, regular fault‑injection drills, and a robust stress‑testing framework.

6. Data‑Plane Optimizations

Node auto‑operations (auto‑upgrade, self‑healing, security patches).

Image acceleration using DADI‑based on‑demand loading and P2P distribution.

Detailed Component Optimizations

APIServer

Automatic elasticity based on request pressure and cluster size.

Soft load‑balancing (Goaway) to avoid OOM on overloaded instances.

Observability dashboards that highlight non‑standard LIST requests.

Periodic cleanup of unused Kubernetes objects.

Etcd

Separate Data and Event etcd clusters to isolate workloads.

Apply VPA to etcd pods based on resource usage.

AutoDefrag operator monitors DB size and triggers defragmentation.

Scheduler / KCM / CCM

Increase QPS/Burst limits for large‑scale environments.

Enforce LIST requests with resourceVersion=0 to read from cache.

Standardize protobuf serialization.

Prefer Informer mechanisms to reduce control‑plane load.

Throttle client request frequency.

Introduce a horizontally scalable relay (e.g., Poseidon) for high‑traffic DaemonSet or ECI pods.

ACK Stability Product Features and Best Practices

Prometheus for ACK Pro

Provides end‑to‑end observability from application to infrastructure, with interactive dashboards covering global resources, node health, core component metrics, cluster events, and eBPF‑based non‑intrusive application monitoring.

Container AIOps Suite

Leverages a knowledge‑base and large language models to deliver:

Full‑stack inspections that surface resource quotas, watermarks, and remediation steps.

Pre‑upgrade checks that detect deprecated APIs and insufficient resources.

Intelligent diagnostics for Pods, Nodes, Ingress, Services, network, and memory anomalies.

Managed Node Pool

Offers four key capabilities:

Automatic kubelet and node‑component upgrades.

Self‑healing of NotReady nodes.

Security patching and kernel hardening.

Rapid elasticity (e.g., 1,000 nodes added in ~55 seconds under P90 workload).

Outlook

ACK’s stability platform will continue to evolve, delivering ongoing improvements in security, performance, cost efficiency, and overall reliability for cloud‑native workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

high availabilityKubernetescluster stabilityACK
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.