Operations 30 min read

Understanding Distributed System High Availability: From Single‑Node to Multi‑Active Architecture

This article explains the principles, evolution, and implementation details of high‑availability architectures—from basic single‑node setups to multi‑active, cross‑region deployments—covering redundancy, disaster recovery, data synchronization, routing strategies, and the challenges of achieving true geo‑distributed active‑active systems.

Architect
Architect
Architect
Understanding Distributed System High Availability: From Single‑Node to Multi‑Active Architecture

01 System Availability

To understand distributed multi‑active architectures, we must start with the basic principles of system design.

Modern software systems are expected to meet three core principles: high performance, high availability, and easy scalability.

High performance means handling large traffic with low latency, e.g., processing 100k requests per second with 5 ms response time.

Easy scalability means the system can expand with minimal cost, adding capacity without code changes.

High availability is measured by two metrics: MTBF (Mean Time Between Failure) and MTTR (Mean Time To Repair). Availability = MTBF / (MTBF + MTTR) × 100%.

Achieving four‑9s (99.99%) availability requires daily downtime under 10 seconds.

Failures can stem from hardware, software bugs, or force majeure, and rapid recovery is essential.

02 Single‑Machine Architecture

We start with the simplest case: a single‑machine deployment where the client talks directly to a single database instance.

This design is vulnerable because a single point of failure (disk crash, OS crash, data loss) can wipe all data.

Backup (periodic copy to another machine) mitigates data loss but introduces two problems: recovery time (service downtime) and data staleness (backups are not real‑time).

Therefore, backup alone cannot meet high‑availability requirements.

03 Master‑Slave Replication

Deploy a second database instance as a real‑time replica of the primary (master‑slave). Benefits include higher data integrity, fault tolerance (failover to slave), and read‑performance improvement.

Application servers can also be replicated across machines, and a load balancer (e.g., Nginx or LVS) distributes traffic, ensuring continuity if one server fails.

04 Uncontrollable Risks

Even with multiple machines, if they share the same rack or switch, a single hardware failure can still cause outage.

Distributing machines across different racks reduces risk, but the entire data center remains a single point of failure.

Historical incidents (fiber cuts, data‑center outages) illustrate that data‑center‑level failures can affect millions of users.

05 Same‑City Disaster Recovery

To protect against data‑center failures, deploy a second data center in the same city and synchronize data via backup (cold standby) or real‑time replication (hot standby).

Cold standby stores data copies but does not serve traffic; hot standby keeps a synchronized copy ready for immediate failover.

06 Same‑City Active‑Active

Both data centers serve live traffic simultaneously, providing load sharing and instant failover.

However, the standby data center’s storage is typically read‑only (slave), so write traffic must still be routed to the primary data center.

07 Two‑City Three‑Center

To survive city‑wide disasters, add a third data center in a different city (cold standby) while the first two remain active‑active.

08 Pseudo‑Active‑Active Across Cities

Simply extending same‑city active‑active to different cities introduces high network latency (30‑100 ms) and potential packet loss, making cross‑city reads/writes impractical.

09 True Active‑Active Across Cities

Both data centers must host independent master databases and synchronize data bidirectionally using middleware (e.g., Canal, RedisShake, MongoShake) to avoid cross‑city latency.

Conflict resolution is required when concurrent writes occur; solutions include automatic merge based on timestamps (requires synchronized clocks) or preventing conflicts by routing users to a single data center.

10 Implementing Active‑Active

Three common routing/sharding strategies ensure a user’s requests stay within one data center:

Business‑type sharding (different services bound to specific data centers).

Hash‑based sharding (user ID modulo routing table).

Geographic sharding (route users based on location).

Global data that requires strong consistency (e.g., configuration, inventory) may still use a primary‑slave model.

11 Multi‑Active (Geo‑Distributed)

Extend true active‑active to multiple regions. A star topology can simplify synchronization by using a central hub data center that aggregates writes and propagates them to all others.

This architecture offers unlimited scalability, rapid failover, and high availability.

Summary

1. Good software architecture follows high performance, high availability, and easy scalability.

2. Rapid recovery after failures is the essence of high availability; multi‑active architectures are effective.

3. Redundancy (backup, master‑slave, disaster recovery, active‑active, multi‑active) is the core technique.

4. Same‑city disaster recovery includes cold standby (data only) and hot standby (data + ready service).

5. Same‑city active‑active improves both availability and performance.

6. Two‑city three‑center adds a remote disaster‑recovery site but requires activation time.

7. True active‑active across cities eliminates single‑city failure risk but adds complexity.

8. Multi‑active builds on active‑active, allowing unlimited geographic expansion and maximum availability.

distributed systemssystem architecturehigh availabilitydisaster recoverydata replicationactive-active
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.