Cloud Computing 35 min read

Designing Highly Available Cloud‑Native Applications on Alibaba Cloud ACK

This article explains how to build robust, highly available cloud‑native applications on Alibaba Cloud Container Service for Kubernetes (ACK) by covering architecture principles, multi‑zone cluster design, Kubernetes HA features such as topology spread constraints and pod anti‑affinity, storage strategies, load‑balancing, virtual nodes, health probes, monitoring, and multi‑cluster deployment patterns.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Designing Highly Available Cloud‑Native Applications on Alibaba Cloud ACK

Introduction

With the rapid growth of cloud‑native technologies, ensuring high availability (HA) for applications has become critical for enterprise services in terms of reliability, stability, and security. Alibaba Cloud Container Service for Kubernetes (ACK) provides the foundation for building HA architectures.

Application HA Design Principles

Designing a HA architecture for cloud‑native applications should consider the following aspects:

Cluster design : Deploy control‑plane and data‑plane components across multiple nodes and zones. ACK Pro offers multi‑zone control‑plane HA with SLA 99.95% (≥3 zones) or 99.50% (≤2 zones).

Container design : Use Deployments, StatefulSets, or OpenKruise CRDs to run multiple replicas and configure auto‑scaling policies.

Resource scheduling : Leverage Kubernetes scheduler, node/zone affinity, and topology spread constraints to distribute Pods across nodes, zones, and topology domains.

Storage design : Attach persistent volumes (PV/PVC) to avoid data loss; use StatefulSets for stateful workloads.

Failure recovery : Enable liveness probes and automatic restart/re‑scheduling.

Network design : Expose services via Service and Ingress.

Monitoring & alerting : Use Prometheus, Thanos, Alertmanager, etc., to detect and react to failures.

Full‑stack HA : Ensure every component—from infrastructure to application code—has redundancy and fault‑tolerance.

Kubernetes HA Techniques and ACK Implementations

Multi‑zone Control‑plane and Data‑plane

Kubernetes clusters achieve HA by deploying control‑plane components (etcd, kube‑apiserver, controller‑manager, scheduler) and data‑plane nodes in different availability zones (AZs). ACK automates this deployment and provides SLA guarantees.

Topology Spread Constraints

TopologySpreadConstraints ensure Pods are evenly spread across topology domains (e.g., zones). Key fields:

maxSkew : Maximum allowed difference in Pod count between domains.

topologyKey : Label that identifies the domain (e.g., topology.kubernetes.io/zone).

whenUnsatisfiable : Action when constraints cannot be met (e.g., DoNotSchedule).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-run-per-zone
spec:
  replicas: 3
  selector:
    matchLabels:
      app: app-run-per-zone
  template:
    metadata:
      labels:
        app: app-run-per-zone
    spec:
      containers:
        - name: app-container
          image: app-image
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: "topology.kubernetes.io/zone"
          whenUnsatisfiable: DoNotSchedule

Pod Anti‑Affinity

PodAntiAffinity prevents Pods from being scheduled on the same node, improving fault isolation. Two policies are available:

requiredDuringSchedulingIgnoredDuringExecution : Hard rule.

preferredDuringSchedulingIgnoredDuringExecution : Soft preference.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-run-per-node
spec:
  replicas: 3
  selector:
    matchLabels:
      app: app-run-per-node
  template:
    metadata:
      labels:
        app: app-run-per-node
    spec:
      containers:
        - name: app-container
          image: app-image
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - app-run-per-node
              topologyKey: "kubernetes.io/hostname"

Multi‑Replica Strategies

Applications can adopt:

Active‑active (multi‑active) : All replicas receive traffic; scale via HPA.

Active‑standby (master‑slave) : One primary replica handles traffic; others standby.

Pod Disruption Budget (PDB)

PDB guarantees a minimum number of available replicas during maintenance.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-with-pdb
spec:
  replicas: 3
  selector:
    matchLabels:
      app: app-with-pdb
  template:
    metadata:
      labels:
        app: app-with-pdb
    spec:
      containers:
        - name: app-container
          image: app-container-image
---
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: pdb-for-app
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: app-with-pdb

Health Probes & Restart Policies

Kubernetes supports three probe types to monitor container health:

Liveness probe : Restarts a container when it fails.

Readiness probe : Removes a container from service traffic until it passes.

Startup probe : Delays other probes until the container has started.

apiVersion: v1
kind: Pod
metadata:
  name: app-with-probe
spec:
  containers:
    - name: app-container
      image: app-image
      livenessProbe:
        httpGet:
          path: /health
          port: 80
        initialDelaySeconds: 10
        periodSeconds: 5
      readinessProbe:
        tcpSocket:
          port: 8080
        initialDelaySeconds: 15
        periodSeconds: 10
      startupProbe:
        exec:
          command:
            - cat
            - /tmp/ready
        initialDelaySeconds: 20
        periodSeconds: 15
  restartPolicy: Always

Storage & Data Decoupling

Use PersistentVolume (PV) and PersistentVolumeClaim (PVC) to abstract storage. Choose appropriate storage class, capacity, and access mode based on workload requirements. Example of a topology‑aware cloud disk storage class and PVC:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: alicloud-disk-topology-essd
provisioner: diskplugin.csi.alibabacloud.com
parameters:
  type: cloud_essd
  fstype: ext4
  zoneId: "cn-hangzhou-a,cn-hangzhou-b,cn-hangzhou-c"
  performanceLevel: PL1
  volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: topology-disk-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: alicloud-disk-topology-essd

Load‑Balancing

Specify master and slave zones for SLB/CLB via Service annotations to keep traffic within the same zone as the node pool.

apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/alibaba-cloud-loadbalancer-master-zoneid: "cn-hangzhou-a"
    service.beta.kubernetes.io/alibaba-cloud-loadbalancer-slave-zoneid: "cn-hangzhou-b"
  name: nginx
spec:
  ports:
    - port: 80
      protocol: TCP
      targetPort: 80
  selector:
    run: nginx
  type: LoadBalancer

Virtual Nodes (Serverless)

ACK Serverless provides virtual nodes backed by Elastic Container Instance (ECI). Multi‑zone virtual nodes are configured via the eci-profile ConfigMap, allowing pod requests to be spread across vSwitches in different AZs.

kubectl -n kube-system edit cm eci-profile
apiVersion: v1
kind: ConfigMap
metadata:
  name: eci-profile
  namespace: kube-system
data:
  vswitchIds: vsw-xxx,vsw-yyy,vsw-zzz
  regionId: cn-hangzhou
  securitygroupId: sg-xxx
  vpcId: vpc-xxx

Monitoring & Alerting

Use kube‑state‑metrics, Prometheus, and Alertmanager to monitor replica health, node health per zone, and other HA metrics. Example alerts for unavailable replicas and low healthy‑node percentage:

# Alert for Deployment with unavailable replicas
- alert: SystemPodReplicasUnavailable
  expr: kube_deployment_status_replicas_unavailable{namespace=~"kube-system|monitoring"} > 0
  for: 1m
  labels:
    severity: L1
  annotations:
    summary: "Deployment {{ $labels.deployment }} has unavailable replicas"

# Alert when healthy node percentage in a zone drops below 80%
- alert: HealthyNodePercentagePerZoneLessThan80
  expr: node_collector_zone_health <= 80
  for: 5m
  labels:
    severity: L1
  annotations:
    summary: "Zone {{ $labels.zone }} healthy node percentage <= 80%"

Single‑Cluster and Multi‑Cluster HA

Within a single ACK cluster, the techniques above provide HA. For higher resilience, deploy multiple clusters across zones or regions, expose services via SLB, and use DNS or Global Traffic Manager (GTM) for traffic routing. ACK One can centrally manage multi‑region clusters, offering unified observability, security, and deployment pipelines.

Architecture diagram
Architecture diagram
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringCloud Nativehigh availabilityKubernetesACKPod AntiAffinityTopology Spread Constraints
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.