How Internet Giants Achieve Multi‑Region High Availability: Strategies and Trade‑offs
This article examines the evolution of high‑availability architectures—from cold backup to multi‑active deployments—detailing the technical trade‑offs, design patterns, and real‑world implementations used by large internet companies such as Alibaba, Eleme, and Taobao.
Introduction
Multi‑region active‑active (often called "multi‑active") has become the preferred high‑availability deployment model for large‑scale internet companies. Companies like Alibaba, Tencent, Baidu, NetEase, and Sina have already rebuilt their systems to support this architecture.
Stateful vs. Stateless Services
Backend services are divided into stateless and stateful. Stateless services achieve high availability easily through load balancers (e.g., F5) because no state needs to be synchronized. The remainder of this article focuses on stateful services that store data on disks or in‑memory databases such as MySQL and Redis.
Evolution of HA Solutions
Cold backup
Hot standby (dual‑machine)
Same‑city active‑active
Cross‑city active‑active
Cross‑city multi‑active
Understanding earlier solutions helps explain why later designs emerged.
Cold Backup
Cold backup copies data files while the database is stopped. It is simple, fast to back up, and allows point‑in‑time recovery by copying files back or switching the data directory. However, it requires service downtime, incurs data loss between backup and restore, and consumes large storage because it is a full backup.
Hot Standby (Active/Standby)
Hot standby synchronizes data from a primary node to a backup node without stopping service. When a failure occurs, the standby becomes primary. Synchronization can be software‑based (e.g., MySQL master/slave binlog replication, SQL Server transactional replication) or hardware‑based (disk mirroring, data‑level disaster recovery). Dual‑machine mutual backup (active/active) is also possible but still relies on a primary/secondary relationship for each business.
Same‑City Active‑Active
Same‑city active‑active extends hot standby across a metropolitan area using dedicated fiber links. It provides disaster recovery for an entire IDC (power outage, network cut) while keeping latency low. The architecture is similar to hot standby; the main benefit is better resource utilization through read‑write separation.
Cross‑City Active‑Active (Two‑City Three‑Center)
In this model, two nearby data centers host primary services, and a third distant center serves as a disaster‑recovery site. Traffic is load‑balanced to the local IDC; if one IDC fails, traffic fails over to the other city, and only when both local IDC’s are down does traffic shift to the remote center. Diagrams illustrate the flow of requests and data synchronization.
Cross‑City Multi‑Active
Multi‑active connects every node with four inbound and outbound links, so any single node failure does not affect service. The trade‑off is higher write latency due to long‑distance synchronization, which can cause throughput loss and data conflicts. Distributed locks or transactions can mitigate conflicts but increase complexity.
Eleme’s “Global Zone” solution enforces strong consistency by directing all writes to a master zone while allowing reads from any zone, leveraging a custom DAL (Data Access Layer) to keep the application unaware of the underlying routing.
Industry Implementations
Alibaba’s ideal multi‑active architecture shards data by province/city, placing write traffic in a local zone and replicating reads elsewhere. Taobao adopts a unit‑based approach: a central unit handles complex business and synchronizes bidirectionally with regional units, while regional units handle simpler workloads and can scale independently.
These designs require extensive code refactoring, distributed transaction handling, cache invalidation strategies, and robust testing/ops pipelines. The complexity often limits adoption to large enterprises.
Thought Questions
If you deploy Eleme‑style multi‑active with sharding by province/city, how would you handle a user located at the intersection of four cities?
Which of your current services can realistically adopt multi‑active, and which cannot?
Is multi‑active necessary for all services, or only for core business functions?
References
Eleme “Multi‑Active Technical Implementation (Part 1) – Overview” https://zhuanlan.zhihu.com/p/32009822
Eleme Framework Tools Blog https://zhuanlan.zhihu.com/eleme-arch
Alibaba “Cross‑Region Multi‑Active and Same‑City Active‑Active Architecture Evolution” https://www.sohu.com/a/158859741_444159
Alibaba Cloud “Database Cross‑Region Multi‑Active Solution” https://help.aliyun.com/document_detail/72721.html
“Cross‑Region Multi‑Active Is Not That Hard” https://wely.iteye.com/blog/2313293
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
