How to Build Multi‑Site High Availability with AHAS‑MSHA: Real‑World E‑Commerce Cases
This article explains the challenges of achieving high availability in unreliable environments, introduces disaster‑tolerance concepts and RPO/RTO metrics, describes Alibaba Cloud's AHAS‑MSHA multi‑site solution and its key features, and walks through two e‑commerce case studies that demonstrate implementation steps, fault‑injection drills, and recovery verification.
Introduction
Complex external conditions and unreliable hardware make high availability for internet services a major challenge; outages can cause economic loss and damage to reputation, even for national‑level applications. Building a disaster‑tolerance architecture is therefore essential for digital enterprises.
Core Concepts
What is Disaster Tolerance?
Disaster tolerance (Disaster Tolerance) means deploying two or more identical systems in geographically separated sites, monitoring health and switching functions so that if one site fails due to fire, flood, earthquake, or sabotage, the application can continue operating from another site.
How to Evaluate Disaster‑Tolerance Capability?
RPO (Recovery Point Objective) : the maximum amount of data loss tolerated, expressed in time.
RTO (Recovery Time Objective) : the maximum allowable downtime before the service must be restored, also expressed in time.
AHAS‑MSHA Overview
MSHA (Multi‑Site High Availability) is a multi‑active disaster‑tolerance solution that decouples business recovery from fault recovery, enabling rapid restoration under failure scenarios. The architecture isolates redundant logical data centers called units , keeping traffic within a unit and limiting the fault‑explosion radius to a single unit.
Key Features
Fast Fault Recovery : follows a “recover first, locate later” principle, separating business recovery time from fault recovery time.
Cross‑Region Capacity Expansion : allows rapid horizontal scaling by deploying additional units in other regions.
Traffic Distribution & Error Correction : validates traffic at each layer and redirects calls that violate routing rules, keeping the fault radius within a unit.
Dirty‑Write Protection : prevents writes to the wrong unit and protects data during synchronization delays.
Applicable Scenarios
Read‑Many‑Write‑Few Business : typical for content or product browsing services where reads dominate and occasional write unavailability is acceptable.
Transaction‑Heavy (流水单据) Business : e‑commerce order processing where reads and writes are tightly coupled and strong consistency is required.
Case Study 1 – Read‑Many‑Write‑Few
The e‑commerce platform initially deployed only a single region, leading to a complete outage when the product service failed. The goal was to achieve “cross‑region multi‑read” for the storefront.
Migration Steps
Partition traffic by userId as the routing key.
Deploy the frontend and product services in two regions.
Configure multi‑active resources in the MSHA console.
Fault‑Injection Drill
Weak Dependency Test : inject a fault into the cart service; the storefront remained functional as expected.
Strong Dependency Test : inject a fault into the product service in the Beijing unit; users with userId=6000 experienced errors, confirming the fault impact.
Explosion‑Radius Verification : users with userId=50 were routed to the Hangzhou unit and were unaffected by the Beijing fault.
Cut‑Over Recovery
Using MSHA’s traffic‑cut‑over feature, the affected user ( userId=6000) was switched to the Hangzhou unit, restoring normal storefront access.
Case Study 2 – Transaction‑Heavy Business
After stabilizing the read‑many‑write‑few scenario, the order service suffered a large‑scale failure, prompting a multi‑active redesign for the order flow.
Migration Steps
Deploy the order service and its database in two regions.
Install the MSHA agent on the order service to enable non‑intrusive SpringCloud RPC cross‑unit routing and dirty‑write protection.
Fault‑Injection Drill
Inject a fault into the Beijing unit’s order service; users with userId=6000 experienced order failures as expected.
Verify explosion radius: users with userId=50 were routed to Hangzhou and remained unaffected.
Cut‑Over Recovery
MSHA’s cut‑over moved userId=6000 traffic to the Hangzhou unit, confirming that the order flow recovered without impact.
Conclusion
The article demonstrates how AHAS‑MSHA provides a powerful multi‑site high‑availability solution, covering both read‑dominant and transaction‑heavy e‑commerce scenarios, and shows how to use AHAS‑Chaos for realistic fault‑injection drills to validate RPO/RTO targets. It emphasizes that disaster‑tolerance is a systematic engineering effort requiring careful assessment of business needs, technology stack, and budget.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
