Operations 37 min read

Why Geo‑Active‑Active Architecture Is the Key to Ultra‑High System Availability

This article explains the principles behind geo‑active‑active (multi‑active) architectures, covering system availability metrics, redundancy strategies from single‑node backups to same‑city and cross‑city active‑active deployments, data‑sync challenges, routing and sharding techniques, and how these designs dramatically improve reliability and scalability.

Efficient Ops

Oct 28, 2021

System Availability

Understanding geo‑active‑active starts with three core architecture principles: high performance, high availability, and easy scalability. Availability is measured by MTBF (Mean Time Between Failures) and MTTR (Mean Time To Repair) using the formula Availability = MTBF / (MTBF + MTTR) * 100%.

Single‑Machine Architecture

A simple single‑node deployment is easy to understand but suffers from catastrophic data loss if the sole database fails. Basic backup copies mitigate loss but introduce long recovery times and data staleness, making it impossible to meet high‑availability targets.

Master‑Slave Replication

Deploying a replica (slave) that synchronously mirrors the master improves data integrity, fault tolerance, and read performance. Adding multiple stateless application instances and a load‑balancer (e.g., Nginx) further reduces single points of failure.

Redundancy

Redundancy—deploying multiple instances of services and hardware—addresses failures at the server, rack, and data‑center levels.

Same‑City Disaster Recovery

To protect against data‑center failures, a second data‑center in the same city is added. Two strategies exist:

Cold backup: periodic data copies to the secondary site; recovery requires manual activation.

Hot backup: real‑time master‑slave replication to the secondary site, enabling instant failover.

Same‑City Active‑Active (双活)

Both data‑centers serve live traffic simultaneously, sharing load and providing immediate switchover without downtime. This requires full redundancy of applications, databases, and a synchronized load‑balancing layer.

Two‑City Three‑Center (两地三中心)

Two active data‑centers remain in one city while a third, geographically distant center provides cold backup for disaster recovery, protecting against city‑wide incidents.

True Cross‑Region Active‑Active (异地双活)

Both regions must host independent master databases and synchronize data bidirectionally. Data‑sync middleware (e.g., Canal, RedisShake, MongoShake) handles MySQL, Redis, and MongoDB replication. Conflict resolution strategies include automatic merge based on timestamps or source‑side avoidance via request routing.

Routing and Sharding

A routing layer directs user requests to a specific region, preventing cross‑region writes. Common sharding methods are:

Business‑type sharding – different services run in different regions.

Hash sharding – user IDs are hashed to assign a region.

Geographic sharding – users are routed based on physical location.

These approaches ensure that a user’s entire workflow stays within one data‑center, eliminating cross‑region latency and write conflicts.

Implementation Tips

Key steps include pre‑deploying applications and load‑balancers in the standby region, establishing bidirectional data sync, and designing routing rules that keep related services together.

Summary

Good architecture follows high performance, high availability, and easy scalability.

Fast recovery, not just fault avoidance, defines true high availability; geo‑active‑active is an effective method.

Redundancy—backup, master‑slave, same‑city disaster recovery, active‑active, two‑city three‑center, cross‑region active‑active, and multi‑region active‑active—progressively improves reliability.

Active‑active setups require synchronized data, routing layers, and conflict‑free sharding to achieve seamless failover and better performance.

Overall, the article provides a comprehensive roadmap from basic single‑node setups to sophisticated multi‑region active‑active architectures, emphasizing redundancy, data synchronization, routing, and conflict avoidance to achieve ultra‑high availability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems System Design disaster recovery geo-active-active

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.