Operations 32 min read

Mastering Distributed High Availability: From Single‑Node to Multi‑Active Architecture

This comprehensive guide explains why modern software systems need geo‑distributed multi‑active architectures, walks through the evolution from basic single‑node setups to master‑slave replication, same‑city disaster recovery, dual‑active, two‑city three‑center, and true multi‑active designs, and highlights the key principles, risks, and implementation strategies for achieving ultra‑high availability.

Ops Development Stories

Mar 10, 2022

Mastering Distributed High Availability: From Single‑Node to Multi‑Active Architecture

01 System Availability

To understand geo‑distributed multi‑active, we start with the three core principles of good software architecture: high performance, high availability, and easy scalability.

High performance means handling large traffic with low latency, e.g., processing 100k concurrent requests per second with 5 ms response time.

Easy scalability means the system can expand with minimal cost, adding capacity without code changes.

High availability is measured by MTBF (Mean Time Between Failure) and MTTR (Mean Time To Repair). Availability = MTBF / (MTBF + MTTR) × 100% and is often expressed as "Nines".

Availability = MTBF / (MTBF + MTTR) × 100%

Achieving 4 Nines requires daily downtime under 10 seconds.

Failures fall into three categories: hardware, software bugs, and force‑majeure events. The faster a system can recover, the higher its availability.

02 Single‑Machine Architecture

A simple start‑up system uses a single application server and a single‑node database. This design is easy but the database is a single point of failure; a disk crash or accidental deletion loses all data.

Backup copies mitigate data loss but introduce two problems: recovery time (service downtime) and data staleness (backups are not real‑time).

03 Master‑Slave Replication

Deploy a second database instance as a real‑time replica of the primary (master). This improves data integrity, fault tolerance (the slave can be promoted), and read performance (read traffic can be offloaded to the slave).

Similarly, stateless application servers can be duplicated, and a load balancer (e.g., Nginx or LVS) distributes requests, providing redundancy.

04 Uncontrollable Risks

Even with multiple servers, if they reside in the same rack or cabinet, a single switch or router failure can bring the whole service down. Distributing servers across different cabinets reduces this risk, but the entire data center remains a single failure domain.

05 Same‑City Disaster Recovery

To protect against data‑center failures, build a second data center in the same city (B site) and connect it with a dedicated line. Data is regularly backed up from A to B (cold standby). However, cold standby suffers from the same recovery latency as simple backups.

Using master‑slave replication between A and B creates a hot standby: B holds a live replica and can take over quickly, reducing downtime dramatically.

06 Same‑City Dual‑Active

Instead of keeping B as a passive standby, deploy services in B as well and route a portion of traffic to it. This provides real‑time load sharing and ensures B is fully exercised, so when A fails, traffic can be switched to B with minimal impact.

07 Two‑Place Three‑Center

To survive city‑level disasters, add a third data center (C) in a different city. C typically runs as a cold backup, receiving periodic data copies from A and B. This three‑center model is common in finance and government projects.

08 Pseudo Geo Dual‑Active

Deploying two data centers in different cities but keeping the same master‑slave relationship leads to high cross‑city latency (30‑100 ms) for read‑write operations, which can degrade user experience dramatically.

09 Real Geo Dual‑Active

Both data centers must host independent master databases and synchronize data bidirectionally. This eliminates cross‑city read/write latency but introduces conflict resolution challenges when the same record is updated simultaneously in both locations.

Conflict handling can be done either by automatic merge based on timestamps (requiring tightly synchronized clocks) or by preventing conflicts through request routing.

10 How to Implement Geo Dual‑Active

Introduce a routing layer that directs users to a specific data center based on business type, hash of user ID, or geographic location. This ensures a user's requests stay within one data center, avoiding cross‑city data access.

Three common routing strategies:

Business‑type sharding – different services run in different data centers.

Hash sharding – user IDs are hashed to decide the target data center.

Geographic sharding – users are routed to the nearest data center.

Even with routing, global data (e.g., configuration, inventory) may still require a primary‑read‑replica model.

11 Geo Multi‑Active

Extending geo dual‑active to multiple data centers creates a mesh or star topology. In a star topology, a central hub data center synchronizes data from all edge sites, simplifying synchronization at the cost of higher hub reliability requirements.

The result is a truly geo‑distributed multi‑active system that can scale arbitrarily, offering the highest availability.

Summary

A good software architecture follows high performance, high availability, and easy scalability. High availability becomes critical as scale grows, and redundancy—through backups, master‑slave, same‑city disaster recovery, dual‑active, two‑city three‑center, geo dual‑active, and geo multi‑active—is the key to achieving it.

Cold standby stores data only; hot standby keeps data live and ready for instant failover. Dual‑active adds read‑write traffic to both sites, improving performance and availability. Two‑city three‑center adds a remote cold backup for city‑level disasters. Geo dual‑active provides active‑active service across cities, while geo multi‑active scales this model to many locations.

Postscript

The article presents the high‑level concepts of geo‑multi‑active architecture. Implementing it requires extensive infrastructure: routing, data‑sync middleware, conflict resolution, monitoring, and operational processes.

For deeper technical details, readers can follow the author's public account and request additional resources.

I am Kaito, a senior backend engineer who not only explains what a technology is but also why it should be used, and tries to distill these thoughts into reusable methodologies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems System Design Disaster Recovery multi-active architecture

Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.