Mastering High Availability: From Cold Backups to Multi‑Region Active‑Active Architectures
This article examines high‑availability strategies for stateful backend services, covering cold backups, active‑standby, same‑city and cross‑city active‑active, and multi‑active designs, while discussing their trade‑offs, implementation details, and real‑world enterprise examples.
Preface
Backend services can be divided into stateless and stateful. Stateless services achieve high availability easily through load balancers, while this article focuses on stateful services.
State is typically persisted on disk or in memory databases such as MySQL, Redis, or JVM memory, which have relatively short lifetimes.
High Availability
1. Common HA Solutions
High‑availability has evolved through several stages:
Cold backup
Active/Standby (dual‑machine hot standby)
Same‑city active‑active
Cross‑city active‑active
Cross‑city multi‑active
Understanding earlier solutions helps explain later designs.
Cold Backup
Cold backup copies data files while the database is stopped, often using simple file‑copy commands (e.g., cp on Linux). It can be performed manually or via scheduled scripts and offers advantages such as simplicity, fast backup, rapid recovery, and point‑in‑time restore.
Simple
Fast backup compared with other methods
Quick recovery by copying files back or adjusting configuration
Ability to restore to a specific point in time
However, cold backup has drawbacks for modern services:
Requires service downtime, which is unacceptable for 24/7 global applications
Data loss between backup and restore times, requiring manual log replay
Full‑volume backup wastes storage and is time‑consuming
Large data volumes make copying impractical and cannot be selective
Balancing these pros and cons is a business decision.
Active/Standby (Dual‑Machine Hot Standby)
Hot standby replicates data while the service remains online; failover still requires a brief outage. The article excludes shared‑disk approaches.
Active/Standby Mode
One primary node serves traffic while a backup node synchronizes data via software (e.g., MySQL master/slave binlog replication, SQL Server transactional replication) or hardware (disk mirroring). Software replication is often called application‑level disaster recovery; hardware replication is data‑level disaster recovery.
Active‑Active Mutual Backup
Both machines act as primary for different services, enabling read‑write separation and better resource utilization, but they cannot serve the same business simultaneously.
Same‑City Active‑Active
Deploying two data centers within the same city mitigates a single‑site failure (power outage, network loss). The architecture resembles hot standby but with greater distance; latency remains low.
With code assistance, true active‑active can provide read‑write on both sites, though not all applications can support it.
Many companies adopt a “two‑site three‑center” model: two active sites and a remote disaster‑recovery site that only stores data and takes over when both active sites fail.
Cross‑City Active‑Active
When a large‑scale outage occurs, traffic can be switched to a remote city, sacrificing user experience but maintaining service continuity.
Most internet companies adopt cross‑city active‑active, despite higher latency and potential data conflicts.
Cross‑City Multi‑Active
Extending the active‑active concept, each node connects to a local database cluster; failover to a remote cluster occurs only when the local cluster is completely unavailable.
Longer synchronization times increase throughput loss and data conflicts. Solutions include distributed locks, eventual consistency, sharding, and specialized architectures such as “Global Zone” where writes are directed to a single master data center.
For applications with strict consistency requirements, a “Global Zone” provides cross‑region read‑write separation, routing all writes to a master data center while reads can be served locally. —《Ele.me Multi‑Region Active‑Active Technical Implementation (Part 1)》
Multi‑active is a stepping stone toward architectures that support horizontal scaling and reduce conflict risk.
Multi‑Active Architectures in Large Enterprises
Examples from Alibaba and Taobao illustrate how business units are sharded, with a central unit handling complex transactions and peripheral units handling simpler workloads. This requires extensive code refactoring, distributed transaction handling, and robust testing and operations.
Implementing such disaster‑recovery levels demands strong foundational capabilities: data transfer, verification, and a simplified data access layer.
Source: https://blog.dogchao.cn/?p=299
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
