How Does Geo‑Active Multi‑Active Architecture Achieve Near‑Zero Downtime?
This article explains the principles, evolution steps, and concrete techniques behind geo‑active multi‑active (distributed) system architectures, covering high‑availability metrics, redundancy patterns from single‑node to multi‑data‑center designs, data‑sync middleware, routing strategies, and the trade‑offs of each approach.
1. System Availability
Modern software must satisfy three core architectural principles: high performance, high availability, and easy scalability. High performance means handling massive traffic with low latency (e.g., 10⁵ concurrent requests per second, 5 ms response). High availability is measured by MTBF (Mean Time Between Failure) and MTTR (Mean Time To Repair) and expressed as a percentage using the formula Availability = MTBF / (MTBF + MTTR) * 100%, often described as "Nines" (e.g., 4 9s = 99.99%).
2. Single‑Machine Architecture
A naïve deployment consists of a single application server directly accessing a single‑node database. While simple, the database is a single point of failure: disk loss, OS crash, or accidental deletion leads to total data loss.
Backup (periodic cp of the database files to another machine) mitigates data loss but introduces two problems: long recovery time (service downtime) and data staleness (backups are not up‑to‑date). Consequently, the availability cannot reach even 1 9 (90%).
3. Master‑Slave Replication
Deploy a second database instance as a real‑time replica of the primary (master). This improves data integrity, fault tolerance (the slave can be promoted to master), and read performance (read traffic can be offloaded to the slave).
Stateless business services can also be duplicated across machines, and a load‑balancer (e.g., Nginx or LVS) distributes requests, ensuring that if one machine fails the others continue serving traffic.
4. Risks at the Data‑Center Level
Even with redundant machines, a failure of the entire data‑center (e.g., power outage, network switch failure, natural disaster) can still bring the service down. Historical incidents (Alipay fiber cut, B‑Station outage, Futu Securities power loss) illustrate this risk.
5. Same‑City Disaster Recovery (Cold & Hot Backup)
Deploy a second data‑center in the same city and connect it via a dedicated line. Two backup modes exist:
Cold backup – only data is replicated (periodic copy). The standby site does not serve traffic, leading to long recovery times.
Hot backup – the standby site maintains a real‑time replica and a ready‑to‑serve load‑balancer. DNS can be switched instantly, providing near‑zero downtime.
6. Same‑City Active‑Active (Dual‑Active)
Both data‑centers serve live traffic simultaneously. To avoid write conflicts, reads can be served by either site, but writes must be directed to the primary site, or a read‑write split must be enforced via middleware.
Gradual traffic migration (10 % → 30 % → 100 %) allows validation of the standby site before full cut‑over.
7. Two‑City Three‑Center Architecture
Two active sites in one city plus a third, geographically distant site used only for disaster backup. This mitigates city‑level disasters but still requires time to activate the third site.
8. Pseudo‑Geo‑Active Dual‑Active
Simply connecting two city‑level active sites with a cross‑city link leads to unacceptable latency (30 ms–100 ms round‑trip) and potential packet loss, making the architecture unsuitable for real‑time workloads.
9. True Geo‑Active Dual‑Active
Both sites maintain independent master databases and synchronize data bidirectionally using middleware (e.g., Alibaba Canal, RedisShake, MongoShake) or custom solutions. This eliminates cross‑site reads/writes, but introduces conflict‑resolution challenges when the same record is updated concurrently.
Conflict resolution can rely on timestamps (requires tightly synchronized clocks) – fragile.
Better approach: prevent conflicts by routing users to a single site (sharding) so that a user’s writes never span sites.
10. Sharding Strategies for Conflict Avoidance
Three common routing rules:
Business‑type sharding – different services are bound to specific sites.
Hash‑based sharding – user ID hash determines the target site.
Geographic sharding – users are routed based on their physical location (e.g., Beijing users to Beijing site, Shanghai users to Shanghai site).
These ensure that a single user’s requests stay within one data‑center, eliminating cross‑site write conflicts.
11. Geo‑Active Multi‑Active
Extending true geo‑active dual‑active to an arbitrary number of sites. Two models exist:
Mesh – every site replicates to every other site (complex as the number of sites grows).
Star – each site replicates only to a central hub, which then propagates changes to the others, simplifying synchronization at the cost of a single point of high load.
Both achieve near‑zero downtime, high scalability, and the ability to add new sites by simply defining routing rules.
In summary, achieving high availability in large‑scale systems relies on redundancy at every layer: backups, master‑slave replication, same‑city disaster recovery, active‑active dual‑city, two‑city three‑center, true geo‑active dual‑city, and finally geo‑active multi‑city architectures.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
