Operations 14 min read

How to Build Multi‑Site High Availability with AHAS‑MSHA: Real‑World E‑Commerce Cases

This article explains the challenges of achieving high availability in unreliable environments, introduces disaster‑tolerance concepts and RPO/RTO metrics, describes Alibaba Cloud's AHAS‑MSHA multi‑site solution and its key features, and walks through two e‑commerce case studies that demonstrate implementation steps, fault‑injection drills, and recovery verification.

Alibaba Cloud Native

Dec 21, 2020

How to Build Multi‑Site High Availability with AHAS‑MSHA: Real‑World E‑Commerce Cases

Introduction

Complex external conditions and unreliable hardware make high availability for internet services a major challenge; outages can cause economic loss and damage to reputation, even for national‑level applications. Building a disaster‑tolerance architecture is therefore essential for digital enterprises.

Core Concepts

What is Disaster Tolerance?

Disaster tolerance (Disaster Tolerance) means deploying two or more identical systems in geographically separated sites, monitoring health and switching functions so that if one site fails due to fire, flood, earthquake, or sabotage, the application can continue operating from another site.

How to Evaluate Disaster‑Tolerance Capability?

RPO (Recovery Point Objective) : the maximum amount of data loss tolerated, expressed in time.

RTO (Recovery Time Objective) : the maximum allowable downtime before the service must be restored, also expressed in time.

AHAS‑MSHA Overview

MSHA (Multi‑Site High Availability) is a multi‑active disaster‑tolerance solution that decouples business recovery from fault recovery, enabling rapid restoration under failure scenarios. The architecture isolates redundant logical data centers called units , keeping traffic within a unit and limiting the fault‑explosion radius to a single unit.

Key Features

Fast Fault Recovery : follows a “recover first, locate later” principle, separating business recovery time from fault recovery time.

Cross‑Region Capacity Expansion : allows rapid horizontal scaling by deploying additional units in other regions.

Traffic Distribution & Error Correction : validates traffic at each layer and redirects calls that violate routing rules, keeping the fault radius within a unit.

Dirty‑Write Protection : prevents writes to the wrong unit and protects data during synchronization delays.

Applicable Scenarios

Read‑Many‑Write‑Few Business : typical for content or product browsing services where reads dominate and occasional write unavailability is acceptable.

Transaction‑Heavy (流水单据) Business : e‑commerce order processing where reads and writes are tightly coupled and strong consistency is required.

Case Study 1 – Read‑Many‑Write‑Few

The e‑commerce platform initially deployed only a single region, leading to a complete outage when the product service failed. The goal was to achieve “cross‑region multi‑read” for the storefront.

Migration Steps

Partition traffic by userId as the routing key.

Deploy the frontend and product services in two regions.

Configure multi‑active resources in the MSHA console.

Fault‑Injection Drill

Weak Dependency Test : inject a fault into the cart service; the storefront remained functional as expected.

Strong Dependency Test : inject a fault into the product service in the Beijing unit; users with userId=6000 experienced errors, confirming the fault impact.

Explosion‑Radius Verification : users with userId=50 were routed to the Hangzhou unit and were unaffected by the Beijing fault.

Cut‑Over Recovery

Using MSHA’s traffic‑cut‑over feature, the affected user ( userId=6000) was switched to the Hangzhou unit, restoring normal storefront access.

Case Study 2 – Transaction‑Heavy Business

After stabilizing the read‑many‑write‑few scenario, the order service suffered a large‑scale failure, prompting a multi‑active redesign for the order flow.

Migration Steps

Deploy the order service and its database in two regions.

Install the MSHA agent on the order service to enable non‑intrusive SpringCloud RPC cross‑unit routing and dirty‑write protection.

Fault‑Injection Drill

Inject a fault into the Beijing unit’s order service; users with userId=6000 experienced order failures as expected.

Verify explosion radius: users with userId=50 were routed to Hangzhou and remained unaffected.

Cut‑Over Recovery

MSHA’s cut‑over moved userId=6000 traffic to the Hangzhou unit, confirming that the order flow recovered without impact.

Conclusion

The article demonstrates how AHAS‑MSHA provides a powerful multi‑site high‑availability solution, covering both read‑dominant and transaction‑heavy e‑commerce scenarios, and shows how to use AHAS‑Chaos for realistic fault‑injection drills to validate RPO/RTO targets. It emphasizes that disaster‑tolerance is a systematic engineering effort requiring careful assessment of business needs, technology stack, and budget.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

e-commerce High Availability chaos engineering Disaster Recovery Multi‑Site AHAS MSHA

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.