Cloud Native 31 min read

Service-Level Disaster Recovery with Alibaba Cloud Service Mesh (ASM) across Multi-Cluster and Multi-Region Deployments

This guide explains how to handle service‑level failures in Kubernetes by using Alibaba Cloud Service Mesh (ASM) to automatically detect faults, shift traffic based on geographic priority, and implement various multi‑cluster, multi‑region, and multi‑cloud topologies for high availability.

Alibaba Cloud Infrastructure

Jan 10, 2025

Service-Level Disaster Recovery with Alibaba Cloud Service Mesh (ASM) across Multi-Cluster and Multi-Region Deployments

Service‑level failures in cloud‑native environments occur when one or more workloads in a Kubernetes cluster become unavailable or degraded, caused by infrastructure issues, cluster misconfiguration, or application bugs.

The article describes how Alibaba Cloud Service Mesh (ASM) can detect such failures and perform automatic, location‑aware traffic shifting to healthy workloads across zones, regions, or clouds.

Two main design dimensions are covered: fault detection & traffic‑shifting mechanisms and application deployment topologies. ASM continuously monitors response codes, connection errors and timeouts; when error thresholds are exceeded it ejects the faulty endpoint and routes traffic according to a priority order (same zone → same region different zone → other region).

Several deployment topologies are compared:

Single‑cluster multi‑AZ

Multi‑cluster single‑region

Multi‑cluster multi‑region

Multi‑cluster multi‑region multi‑cloud

Pros and cons of each topology are listed, highlighting trade‑offs between availability, operational complexity and cost.

The practical implementation steps include:

Provision two Kubernetes clusters in different regions with multi‑AZ node pools.

Create two ASM mesh instances (multi‑master control plane) and add the clusters.

Deploy ASM ingress gateways bound to network load balancers (NLB) and enable sidecar injection.

Deploy a sample three‑service application (mocka, mockb, mockc) using the following Kubernetes manifests:

apiVersion: v1
kind: Service
metadata:
  name: mocka
  labels:
    app: mocka
    service: mocka
spec:
  ports:
  - port: 8000
    name: http
  selector:
    app: mocka
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mocka-cn-hangzhou-h
  labels:
    app: mocka
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mocka
  template:
    metadata:
      labels:
        app: mocka
        locality: cn-hangzhou-h
    spec:
      nodeSelector:
        topology.kubernetes.io/zone: cn-hangzhou-h
      containers:
      - name: default
        image: registry-cn-hangzhou.ack.aliyuncs.com/ack-demo/go-http-sample:tracing
        env:
        - name: version
          value: cn-hangzhou-h
        - name: upstream_url
          value: "http://mockb:8000/"
        ports:
        - containerPort: 8000
...

DestinationRule resources with outlierDetection are applied to enable host‑level circuit breaking:

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: mocka
spec:
  host: mocka
  trafficPolicy:
    outlierDetection:
      splitExternalLocalOriginErrors: true
      consecutiveLocalOriginFailures: 1
      baseEjectionTime: 5m
      consecutive5xxErrors: 1
      interval: 30s
      maxEjectionPercent: 100

After deployment, curl requests demonstrate that traffic stays within the same zone under normal conditions and automatically fails over to another zone or region when a workload becomes unhealthy.

Optional steps show how to expose outlier‑detection metrics via sidecar proxyStatsMatcher, collect them with Prometheus, and configure alerting rules for rapid failure notification.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes Disaster Recovery ASM Traffic Shifting

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.