Cloud Native 13 min read

Best Practices for High Availability and Stability in Alibaba Cloud Container Service for Kubernetes (ACK)

This article presents a comprehensive overview of high‑availability design patterns and best‑practice recommendations for Alibaba Cloud Container Service for Kubernetes (ACK), covering common error scenarios, single‑cluster and multi‑cluster architectures, workload resilience, monitoring, and real‑world case studies.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Best Practices for High Availability and Stability in Alibaba Cloud Container Service for Kubernetes (ACK)

The talk introduces the importance of high‑availability (HA) architectures in cloud‑native environments and explains how Kubernetes serves as the foundation for building resilient services. Using Alibaba Cloud Container Service for Kubernetes (ACK) as an example, the speaker outlines common HA pitfalls such as single‑zone node deployment, missing pod anti‑affinity rules, insufficient health‑monitoring alerts, and the complexity of managing multiple clusters.

ACK’s single‑cluster HA design is described in detail: the control plane runs as pods in a meta‑cluster spread across multiple availability zones (AZs), achieving zone‑level redundancy, while the data plane resources (ECS, SLB, ECI) reside in the user VPC. Control‑plane components align with Alibaba Cloud ECS AZ capabilities, delivering up to 99.95% SLA in three‑AZ regions.

Practical workload‑level HA techniques are presented, including topology spread constraints, pod anti‑affinity, Pod Disruption Budgets, and health probes (liveness, readiness, startup). The guide also covers node‑, deployment‑set‑, and AZ‑level pod distribution strategies to ensure fault isolation.

For enterprise users, the article explains HA configurations for the Container Image Service, both AZ‑level disaster recovery using same‑city redundant OSS buckets and cross‑region replication via multi‑region image service instances, along with steps for custom domain setup and image sync rules.

Cloud‑resource HA is illustrated through Kubernetes Service annotations that map load balancers (CLB/NLB/ALB) to specific AZs, ensuring consistent network performance. Monitoring recommendations include using kube‑state‑metrics and Prometheus alerts for unavailable replica counts and unhealthy node percentages, with example alert rules.

Multi‑cluster HA is achieved via ACK One Fleet, which provides a unified control plane for application distribution, traffic control, security policies, global monitoring, and cluster management across public‑cloud and IDC environments. The One Fleet approach enables cross‑AZ and cross‑region disaster recovery, as well as GitOps‑driven multi‑cluster deployments.

A real‑world case study of Xiaohongshu demonstrates how ACK Pro clusters were configured with zone, deployment‑set, and node‑level HA to meet stringent stability requirements.

The conclusion emphasizes that ACK’s HA architecture and best‑practice guidance, validated by thousands of production clusters, form a solid foundation for reliable cloud‑native services, and the Alibaba Cloud container team continues to improve security, stability, performance, and cost efficiency.

monitoringcloud nativeHigh AvailabilityKubernetesbest practicesAck
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.