Cloud Native 13 min read

Best Practices for High Availability and Stability in Alibaba Cloud Container Service for Kubernetes (ACK)

This article presents a comprehensive overview of high‑availability design patterns and best‑practice recommendations for Alibaba Cloud Container Service for Kubernetes (ACK), covering common error scenarios, single‑cluster and multi‑cluster architectures, workload resilience, monitoring, and real‑world case studies.

Alibaba Cloud Infrastructure

Sep 30, 2024

Best Practices for High Availability and Stability in Alibaba Cloud Container Service for Kubernetes (ACK)

The talk introduces the importance of high‑availability (HA) architectures in cloud‑native environments and explains how Kubernetes serves as the foundation for building resilient services. Using Alibaba Cloud Container Service for Kubernetes (ACK) as an example, the speaker outlines common HA pitfalls such as single‑zone node deployment, missing pod anti‑affinity rules, insufficient health‑monitoring alerts, and the complexity of managing multiple clusters.

ACK’s single‑cluster HA design is described in detail: the control plane runs as pods in a meta‑cluster spread across multiple availability zones (AZs), achieving zone‑level redundancy, while the data plane resources (ECS, SLB, ECI) reside in the user VPC. Control‑plane components align with Alibaba Cloud ECS AZ capabilities, delivering up to 99.95% SLA in three‑AZ regions.

Practical workload‑level HA techniques are presented, including topology spread constraints, pod anti‑affinity, Pod Disruption Budgets, and health probes (liveness, readiness, startup). The guide also covers node‑, deployment‑set‑, and AZ‑level pod distribution strategies to ensure fault isolation.

For enterprise users, the article explains HA configurations for the Container Image Service, both AZ‑level disaster recovery using same‑city redundant OSS buckets and cross‑region replication via multi‑region image service instances, along with steps for custom domain setup and image sync rules.

Cloud‑resource HA is illustrated through Kubernetes Service annotations that map load balancers (CLB/NLB/ALB) to specific AZs, ensuring consistent network performance. Monitoring recommendations include using kube‑state‑metrics and Prometheus alerts for unavailable replica counts and unhealthy node percentages, with example alert rules.

Multi‑cluster HA is achieved via ACK One Fleet, which provides a unified control plane for application distribution, traffic control, security policies, global monitoring, and cluster management across public‑cloud and IDC environments. The One Fleet approach enables cross‑AZ and cross‑region disaster recovery, as well as GitOps‑driven multi‑cluster deployments.

A real‑world case study of Xiaohongshu demonstrates how ACK Pro clusters were configured with zone, deployment‑set, and node‑level HA to meet stringent stability requirements.

The conclusion emphasizes that ACK’s HA architecture and best‑practice guidance, validated by thousands of production clusters, form a solid foundation for reliable cloud‑native services, and the Alibaba Cloud container team continues to improve security, stability, performance, and cost efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native High Availability kubernetes best practices ACK

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.