Designing AZ‑Level Disaster Recovery with Alibaba Cloud ACK and Service Mesh ASM
This guide explains how to achieve zone‑level disaster recovery on Alibaba Cloud by deploying multi‑AZ ACK clusters, configuring Service Mesh ASM for observability and traffic shifting, and using Prometheus‑based metrics and alerts to detect and isolate failures, including step‑by‑step instructions and sample YAML manifests.
Zone‑level failures can render workloads in an entire availability zone (AZ) unavailable, causing service disruption or data errors. Common causes include power outages, infrastructure faults, resource exhaustion, and human error. To mitigate these risks, both the cloud infrastructure and the application itself must be designed for resilience.
1. Multi‑AZ High Availability with Alibaba Cloud Managed Components
Alibaba Cloud Container Service for Kubernetes (ACK) and Service Mesh (ASM) deploy all managed components across multiple replicas and AZs. The control plane, worker nodes, and elastic container instances are spread evenly, ensuring that a single AZ failure does not affect the overall cluster.
2. Deploying Applications Across AZs
When creating an ACK cluster, select node pools that span multiple AZs and use balanced scaling policies. Use topology spread constraints or node selectors (e.g., topology.kubernetes.io/zone) to distribute workloads evenly. Refer to the ACK high‑availability architecture guide for details.
3. Observability and Fault Detection
ASM’s sidecar proxies expose metrics such as request counts, latency, and error codes. By adding the locality dimension (e.g., xds.node.locality.zone) to Prometheus metrics, you can monitor each AZ separately. Sample PromQL queries are provided to view request rates and latency per AZ.
4. Alerting Based on Metrics
Custom Prometheus alerts can be defined for latency or non‑200 response codes, grouped by service and AZ. Example alerts trigger when mockb latency exceeds 3 ms or when mocka returns any status other than 200.
5. Traffic Isolation During an AZ Failure
When an AZ becomes unhealthy, isolate its nodes by adding a taint to make them unschedulable, and use NLB/ALB DNS removal to stop inbound traffic. ASM’s AZ‑traffic‑transfer feature can also redirect east‑west traffic away from the affected zone. After isolation, verify that all traffic originates from healthy AZs using Prometheus queries.
6. Recovery Procedure
To restore service, remove node taints, delete the custom service‑discovery range, and re‑enable DNS for the NLB. This returns traffic to the repaired AZ and resumes normal operation.
The article includes a complete YAML manifest that creates three services (mocka, mockb, mockc) with two replicas each, distributed across two AZs, along with corresponding Istio Gateway and VirtualService definitions.
kubectl apply -f- <<EOF
apiVersion: v1
kind: Service
metadata:
name: mocka
labels:
app: mocka
service: mocka
spec:
ports:
- port: 8000
name: http
selector:
app: mocka
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: mocka-cn-hangzhou-h
labels:
app: mocka
spec:
replicas: 1
selector:
matchLabels:
app: mocka
template:
metadata:
labels:
app: mocka
locality: cn-hangzhou-h
spec:
nodeSelector:
topology.kubernetes.io/zone: cn-hangzhou-h
containers:
- name: default
image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/go-http-sample:tracing
imagePullPolicy: IfNotPresent
env:
- name: version
value: cn-hangzhou-h
- name: app
value: mocka
- name: upstream_url
value: "http://mockb:8000/"
ports:
- containerPort: 8000
---
... (additional deployments for mocka-cn-hangzhou-k, mockb, mockc and corresponding services) ...
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
name: mocka
namespace: default
spec:
selector:
istio: ingressgateway
servers:
- hosts:
- '*'
port:
name: test
number: 80
protocol: HTTP
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: demoapp-vs
namespace: default
spec:
gateways:
- mocka
hosts:
- '*'
http:
- name: test
route:
- destination:
host: mocka
port:
number: 8000
EOFBy following these steps, you can build a robust multi‑AZ deployment on Alibaba Cloud, detect AZ‑level incidents quickly, isolate the affected zone, and recover services with minimal impact.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
