Regional Disaster Recovery Architecture Using ASM Service Mesh and GTM
This guide explains how to design and implement a multi‑region disaster‑recovery solution on Alibaba Cloud by deploying identical Kubernetes clusters, configuring ASM ingress gateways with global traffic manager (GTM) for automatic failover, enabling intra‑cluster traffic retention, and validating the setup with load‑testing tools.
Regional‑level failures can occur due to natural disasters, network outages, human error, or security incidents, causing all zones in a region to lose connectivity, data, or workload availability.
The ASM service mesh can deploy ingress gateways in each Kubernetes cluster (or ECI) and, together with Alibaba Cloud DNS and Global Traffic Manager (GTM), split traffic between two regions under normal conditions and automatically remove the faulty region’s IP from DNS to redirect all traffic to the healthy region.
To validate this approach, prepare two Kubernetes clusters (e.g., cluster-1 and cluster-2 ) in different regions, deploy identical cloud‑native services, expose each cluster’s ASM gateway via a public CLB IP, and configure DNS to resolve a single domain name to both IPs.
Step 1: Create two ACK clusters with EIP‑exposed API servers in separate regions. Step 2: Build a multi‑master control‑plane service mesh by creating two ASM instances (mesh‑1 and mesh‑2) and joining each cluster to its respective mesh. Step 3: Deploy an ASM ingress gateway and the Bookinfo demo application in each cluster, then create gateway rules and virtual services to expose the app. Step 4: Enable the ASM intra‑cluster traffic‑retention feature so that traffic stays within a cluster unless the whole region fails. Step 5: Configure GTM to perform health‑check‑based IP failover, ensuring that when one region’s gateway is removed, all traffic is routed to the remaining healthy gateway. Step 6 (optional): Apply a local rate‑limiting policy to each ingress gateway using the following YAML:
apiVersion: istio.alibabacloud.com/v1beta1
kind: ASMLocalRateLimiter
metadata:
name: ingressgateway
namespace: istio-system
spec:
configs:
- limit:
fill_interval:
seconds: 1
quota: 100
match:
vhost:
name: '*'
port: 80
route:
name_match: gw-to-productage
isGateway: true
workloadSelector:
labels:
istio: ingressgatewayTo test the disaster‑recovery flow, use the fortio load‑testing tool to generate traffic against the domain, then simulate a regional failure by deleting the ingress gateway workload in one cluster. The test shows most requests succeed and traffic is automatically shifted to the healthy region, confirming the ASM‑GTM integration works as intended.
Health checks in GTM automatically remove the failed IP, and alerts can be configured to notify operators for manual intervention if needed.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.