Designing AZ‑Level Disaster Recovery with Alibaba Cloud ACK and Service Mesh ASM
This guide explains how to achieve zone‑level disaster recovery on Alibaba Cloud by deploying multi‑AZ ACK clusters, configuring Service Mesh ASM for observability and traffic shifting, and using Prometheus‑based metrics and alerts to detect and isolate failures, including step‑by‑step instructions and sample YAML manifests.
Zone‑level failures can render workloads in an entire availability zone (AZ) unavailable, causing service disruption or data errors. Common causes include power outages, infrastructure faults, resource exhaustion, and human error. To mitigate these risks, both the cloud infrastructure and the application itself must be designed for resilience.
1. Multi‑AZ High Availability with Alibaba Cloud Managed Components
Alibaba Cloud Container Service for Kubernetes (ACK) and Service Mesh (ASM) deploy all managed components across multiple replicas and AZs. The control plane, worker nodes, and elastic container instances are spread evenly, ensuring that a single AZ failure does not affect the overall cluster.
2. Deploying Applications Across AZs
When creating an ACK cluster, select node pools that span multiple AZs and use balanced scaling policies. Use topology spread constraints or node selectors (e.g., topology.kubernetes.io/zone ) to distribute workloads evenly. Refer to the ACK high‑availability architecture guide for details.
3. Observability and Fault Detection
ASM’s sidecar proxies expose metrics such as request counts, latency, and error codes. By adding the locality dimension (e.g., xds.node.locality.zone ) to Prometheus metrics, you can monitor each AZ separately. Sample PromQL queries are provided to view request rates and latency per AZ.
4. Alerting Based on Metrics
Custom Prometheus alerts can be defined for latency or non‑200 response codes, grouped by service and AZ. Example alerts trigger when mockb latency exceeds 3 ms or when mocka returns any status other than 200.
5. Traffic Isolation During an AZ Failure
When an AZ becomes unhealthy, isolate its nodes by adding a taint to make them unschedulable, and use NLB/ALB DNS removal to stop inbound traffic. ASM’s AZ‑traffic‑transfer feature can also redirect east‑west traffic away from the affected zone. After isolation, verify that all traffic originates from healthy AZs using Prometheus queries.
6. Recovery Procedure
To restore service, remove node taints, delete the custom service‑discovery range, and re‑enable DNS for the NLB. This returns traffic to the repaired AZ and resumes normal operation.
The article includes a complete YAML manifest that creates three services (mocka, mockb, mockc) with two replicas each, distributed across two AZs, along with corresponding Istio Gateway and VirtualService definitions.
kubectl apply -f- <By following these steps, you can build a robust multi‑AZ deployment on Alibaba Cloud, detect AZ‑level incidents quickly, isolate the affected zone, and recover services with minimal impact.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.