Service-Level Disaster Recovery with Alibaba Cloud Service Mesh (ASM) across Multi-Cluster and Multi-Region Deployments
This guide explains how to handle service‑level failures in Kubernetes by using Alibaba Cloud Service Mesh (ASM) to automatically detect faults, shift traffic based on geographic priority, and implement various multi‑cluster, multi‑region, and multi‑cloud topologies for high availability.
Service‑level failures in cloud‑native environments occur when one or more workloads in a Kubernetes cluster become unavailable or degraded, caused by infrastructure issues, cluster misconfiguration, or application bugs.
The article describes how Alibaba Cloud Service Mesh (ASM) can detect such failures and perform automatic, location‑aware traffic shifting to healthy workloads across zones, regions, or clouds.
Two main design dimensions are covered: fault detection & traffic‑shifting mechanisms and application deployment topologies. ASM continuously monitors response codes, connection errors and timeouts; when error thresholds are exceeded it ejects the faulty endpoint and routes traffic according to a priority order (same zone → same region different zone → other region).
Several deployment topologies are compared:
Single‑cluster multi‑AZ
Multi‑cluster single‑region
Multi‑cluster multi‑region
Multi‑cluster multi‑region multi‑cloud
Pros and cons of each topology are listed, highlighting trade‑offs between availability, operational complexity and cost.
The practical implementation steps include:
Provision two Kubernetes clusters in different regions with multi‑AZ node pools.
Create two ASM mesh instances (multi‑master control plane) and add the clusters.
Deploy ASM ingress gateways bound to network load balancers (NLB) and enable sidecar injection.
Deploy a sample three‑service application (mocka, mockb, mockc) using the following Kubernetes manifests:
apiVersion: v1
kind: Service
metadata:
name: mocka
labels:
app: mocka
service: mocka
spec:
ports:
- port: 8000
name: http
selector:
app: mocka
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: mocka-cn-hangzhou-h
labels:
app: mocka
spec:
replicas: 1
selector:
matchLabels:
app: mocka
template:
metadata:
labels:
app: mocka
locality: cn-hangzhou-h
spec:
nodeSelector:
topology.kubernetes.io/zone: cn-hangzhou-h
containers:
- name: default
image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/go-http-sample:tracing
env:
- name: version
value: cn-hangzhou-h
- name: upstream_url
value: "http://mockb:8000/"
ports:
- containerPort: 8000
...DestinationRule resources with outlierDetection are applied to enable host‑level circuit breaking:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: mocka
spec:
host: mocka
trafficPolicy:
outlierDetection:
splitExternalLocalOriginErrors: true
consecutiveLocalOriginFailures: 1
baseEjectionTime: 5m
consecutive5xxErrors: 1
interval: 30s
maxEjectionPercent: 100After deployment, curl requests demonstrate that traffic stays within the same zone under normal conditions and automatically fails over to another zone or region when a workload becomes unhealthy.
Optional steps show how to expose outlier‑detection metrics via sidecar proxyStatsMatcher , collect them with Prometheus, and configure alerting rules for rapid failure notification.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.