Cloud Native 24 min read

Designing AZ‑Level Disaster Recovery with Alibaba Cloud ACK and Service Mesh ASM

This guide explains how to achieve zone‑level disaster recovery on Alibaba Cloud by deploying multi‑AZ ACK clusters, configuring Service Mesh ASM for observability and traffic shifting, and using Prometheus‑based metrics and alerts to detect and isolate failures, including step‑by‑step instructions and sample YAML manifests.

Alibaba Cloud Infrastructure

Jan 8, 2025

Designing AZ‑Level Disaster Recovery with Alibaba Cloud ACK and Service Mesh ASM

Zone‑level failures can render workloads in an entire availability zone (AZ) unavailable, causing service disruption or data errors. Common causes include power outages, infrastructure faults, resource exhaustion, and human error. To mitigate these risks, both the cloud infrastructure and the application itself must be designed for resilience.

1. Multi‑AZ High Availability with Alibaba Cloud Managed Components

Alibaba Cloud Container Service for Kubernetes (ACK) and Service Mesh (ASM) deploy all managed components across multiple replicas and AZs. The control plane, worker nodes, and elastic container instances are spread evenly, ensuring that a single AZ failure does not affect the overall cluster.

2. Deploying Applications Across AZs

When creating an ACK cluster, select node pools that span multiple AZs and use balanced scaling policies. Use topology spread constraints or node selectors (e.g., topology.kubernetes.io/zone) to distribute workloads evenly. Refer to the ACK high‑availability architecture guide for details.

3. Observability and Fault Detection

ASM’s sidecar proxies expose metrics such as request counts, latency, and error codes. By adding the locality dimension (e.g., xds.node.locality.zone) to Prometheus metrics, you can monitor each AZ separately. Sample PromQL queries are provided to view request rates and latency per AZ.

4. Alerting Based on Metrics

Custom Prometheus alerts can be defined for latency or non‑200 response codes, grouped by service and AZ. Example alerts trigger when mockb latency exceeds 3 ms or when mocka returns any status other than 200.

5. Traffic Isolation During an AZ Failure

When an AZ becomes unhealthy, isolate its nodes by adding a taint to make them unschedulable, and use NLB/ALB DNS removal to stop inbound traffic. ASM’s AZ‑traffic‑transfer feature can also redirect east‑west traffic away from the affected zone. After isolation, verify that all traffic originates from healthy AZs using Prometheus queries.

6. Recovery Procedure

To restore service, remove node taints, delete the custom service‑discovery range, and re‑enable DNS for the NLB. This returns traffic to the repaired AZ and resumes normal operation.

The article includes a complete YAML manifest that creates three services (mocka, mockb, mockc) with two replicas each, distributed across two AZs, along with corresponding Istio Gateway and VirtualService definitions.

kubectl apply -f- <<EOF
apiVersion: v1
kind: Service
metadata:
  name: mocka
  labels:
    app: mocka
    service: mocka
spec:
  ports:
  - port: 8000
    name: http
  selector:
    app: mocka
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mocka-cn-hangzhou-h
  labels:
    app: mocka
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mocka
  template:
    metadata:
      labels:
        app: mocka
        locality: cn-hangzhou-h
    spec:
      nodeSelector:
        topology.kubernetes.io/zone: cn-hangzhou-h
      containers:
      - name: default
        image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/go-http-sample:tracing
        imagePullPolicy: IfNotPresent
        env:
        - name: version
          value: cn-hangzhou-h
        - name: app
          value: mocka
        - name: upstream_url
          value: "http://mockb:8000/"
        ports:
        - containerPort: 8000
---
... (additional deployments for mocka-cn-hangzhou-k, mockb, mockc and corresponding services) ...
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: mocka
  namespace: default
spec:
  selector:
    istio: ingressgateway
  servers:
  - hosts:
    - '*'
    port:
      name: test
      number: 80
      protocol: HTTP
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: demoapp-vs
  namespace: default
spec:
  gateways:
  - mocka
  hosts:
  - '*'
  http:
  - name: test
    route:
    - destination:
        host: mocka
        port:
          number: 8000
EOF

By following these steps, you can build a robust multi‑AZ deployment on Alibaba Cloud, detect AZ‑level incidents quickly, isolate the affected zone, and recover services with minimal impact.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes Prometheus Service Mesh Multi‑AZ

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.