Operations 14 min read

Mastering High Availability: From Cold Backup to Multi‑Region Active‑Active

This article analyzes various high‑availability strategies for stateful backend services—covering cold backup, dual‑machine hot standby, same‑city active‑active, remote active‑active, and multi‑region active‑active architectures—detailing their benefits, limitations, and practical implementation considerations.

IT Architects Alliance
IT Architects Alliance
IT Architects Alliance
Mastering High Availability: From Cold Backup to Multi‑Region Active‑Active

Background

Backend services can be stateless or stateful. Stateless services achieve high availability (HA) easily with load balancers (e.g., F5). Stateful services must preserve state on disks (MySQL, etc.) or memory stores (Redis, JVM), which requires explicit HA mechanisms.

Evolution of HA Solutions

Cold backup

Dual‑machine hot standby (active/standby)

Same‑city active‑active

Remote active‑active

Remote multi‑active

Cold Backup

Creates a point‑in‑time snapshot by copying data files (e.g., cp on Linux). Advantages: simple, fast backup/restore, can roll back to a specific moment (useful for incident recovery). Drawbacks: requires service downtime, consumes full‑volume storage, and any writes between the snapshot and failure are lost.

Dual‑Machine Hot Standby

Also called active/standby. One primary node serves traffic while a secondary continuously replicates data. Replication may be software‑based (MySQL master/slave via binlog, SQL Server transactional replication) or hardware‑based (disk mirroring). Failover is performed by promoting the standby to active.

Active/Standby mode : Master serves requests; standby receives binlog or block‑level copies and acts as a disaster‑recovery backup.

Dual‑machine mutual backup : Each machine acts as master for different services, enabling read‑write separation and better resource utilization, but the two cannot serve the same business simultaneously.

Same‑City Active‑Active

Two active nodes reside in the same city, connected by high‑speed dedicated lines. Both handle traffic, eliminating the standby resource waste while still providing disaster resilience. Latency becomes a concern as distance grows.

Remote Active‑Active

Two data centers (IDC1, IDC2) host active clusters; a third site (IDC3) holds a backup. Load balancers route traffic to the nearest active site; failover occurs when a site becomes unavailable. This mitigates single‑site failures but introduces network latency to the remote site, potentially degrading user experience.

Remote active‑active diagram
Remote active‑active diagram

Diagram: Simple remote active‑active setup.

Remote Multi‑Active

Extends remote active‑active to a three‑center topology where each node connects to four others. Any single node failure does not impact service, but write latency increases, leading to potential data conflicts. Conflict mitigation techniques include distributed locks, sharding, and eventual consistency mechanisms.

Remote multi‑active architecture
Remote multi‑active architecture

Diagram: Remote multi‑active architecture.

Strong Consistency Example (Global Zone)

For applications requiring strict consistency, a Global Zone directs all writes to a single master data center while reads can be served from any slave. This provides strong consistency without exposing complexity to the business layer.

—《Ele.me Multi‑Region Active‑Active Technical Implementation (Part 1) Overview》

Practical Insights

Large internet companies (e.g., Ele.me, Alibaba) typically migrate through the stages: dual‑machine hot standby → same‑city active‑active → remote active‑active → remote multi‑active. Each transition demands extensive code refactoring, sharding, distributed transactions, and robust testing pipelines.

For e‑commerce platforms with complex inter‑service dependencies, a unit‑based sharding strategy (as used by Taobao) isolates critical business units while allowing elastic scaling of peripheral services.

Implementing remote multi‑active at scale requires strong foundational capabilities: reliable data transfer, integrity verification, and simplified client‑side write/sync controls.

Conclusion

Multi‑active architectures deliver superior fault tolerance but introduce higher latency, conflict‑resolution complexity, and operational overhead. Organizations must balance these trade‑offs against business continuity requirements and invest in the necessary infrastructure to support robust disaster recovery.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

high availabilitySystem Designdisaster recoveryActive-Activebackend operationscold backupmulti‑region architecture
IT Architects Alliance
Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.