Industry Insights 16 min read

How Internet Giants Achieve Multi‑Region High Availability: Strategies and Trade‑offs

This article examines the evolution of high‑availability architectures—from cold backup to multi‑active deployments—detailing the technical trade‑offs, design patterns, and real‑world implementations used by large internet companies such as Alibaba, Eleme, and Taobao.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
How Internet Giants Achieve Multi‑Region High Availability: Strategies and Trade‑offs

Introduction

Multi‑region active‑active (often called "multi‑active") has become the preferred high‑availability deployment model for large‑scale internet companies. Companies like Alibaba, Tencent, Baidu, NetEase, and Sina have already rebuilt their systems to support this architecture.

Stateful vs. Stateless Services

Backend services are divided into stateless and stateful. Stateless services achieve high availability easily through load balancers (e.g., F5) because no state needs to be synchronized. The remainder of this article focuses on stateful services that store data on disks or in‑memory databases such as MySQL and Redis.

Evolution of HA Solutions

Cold backup

Hot standby (dual‑machine)

Same‑city active‑active

Cross‑city active‑active

Cross‑city multi‑active

Understanding earlier solutions helps explain why later designs emerged.

Cold Backup

Cold backup copies data files while the database is stopped. It is simple, fast to back up, and allows point‑in‑time recovery by copying files back or switching the data directory. However, it requires service downtime, incurs data loss between backup and restore, and consumes large storage because it is a full backup.

Hot Standby (Active/Standby)

Hot standby synchronizes data from a primary node to a backup node without stopping service. When a failure occurs, the standby becomes primary. Synchronization can be software‑based (e.g., MySQL master/slave binlog replication, SQL Server transactional replication) or hardware‑based (disk mirroring, data‑level disaster recovery). Dual‑machine mutual backup (active/active) is also possible but still relies on a primary/secondary relationship for each business.

Same‑City Active‑Active

Same‑city active‑active extends hot standby across a metropolitan area using dedicated fiber links. It provides disaster recovery for an entire IDC (power outage, network cut) while keeping latency low. The architecture is similar to hot standby; the main benefit is better resource utilization through read‑write separation.

Cross‑City Active‑Active (Two‑City Three‑Center)

In this model, two nearby data centers host primary services, and a third distant center serves as a disaster‑recovery site. Traffic is load‑balanced to the local IDC; if one IDC fails, traffic fails over to the other city, and only when both local IDC’s are down does traffic shift to the remote center. Diagrams illustrate the flow of requests and data synchronization.

Cross‑City Multi‑Active

Multi‑active connects every node with four inbound and outbound links, so any single node failure does not affect service. The trade‑off is higher write latency due to long‑distance synchronization, which can cause throughput loss and data conflicts. Distributed locks or transactions can mitigate conflicts but increase complexity.

Eleme’s “Global Zone” solution enforces strong consistency by directing all writes to a master zone while allowing reads from any zone, leveraging a custom DAL (Data Access Layer) to keep the application unaware of the underlying routing.

Industry Implementations

Alibaba’s ideal multi‑active architecture shards data by province/city, placing write traffic in a local zone and replicating reads elsewhere. Taobao adopts a unit‑based approach: a central unit handles complex business and synchronizes bidirectionally with regional units, while regional units handle simpler workloads and can scale independently.

These designs require extensive code refactoring, distributed transaction handling, cache invalidation strategies, and robust testing/ops pipelines. The complexity often limits adoption to large enterprises.

Thought Questions

If you deploy Eleme‑style multi‑active with sharding by province/city, how would you handle a user located at the intersection of four cities?

Which of your current services can realistically adopt multi‑active, and which cannot?

Is multi‑active necessary for all services, or only for core business functions?

References

Eleme “Multi‑Active Technical Implementation (Part 1) – Overview” https://zhuanlan.zhihu.com/p/32009822

Eleme Framework Tools Blog https://zhuanlan.zhihu.com/eleme-arch

Alibaba “Cross‑Region Multi‑Active and Same‑City Active‑Active Architecture Evolution” https://www.sohu.com/a/158859741_444159

Alibaba Cloud “Database Cross‑Region Multi‑Active Solution” https://help.aliyun.com/document_detail/72721.html

“Cross‑Region Multi‑Active Is Not That Hard” https://wely.iteye.com/blog/2313293

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

high availabilitymulti-activeIndustry analysis
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.