Mastering High Availability: From Cold Backup to Multi‑Active Architecture
This article examines high‑availability strategies for stateful backend services, covering cold backup, dual‑machine hot standby, same‑city active‑active, and remote multi‑active solutions, while discussing their benefits, trade‑offs, and architectural patterns for resilient distributed systems.
Preface
Backend services can be classified as stateless or stateful. High availability is straightforward for stateless applications, which can rely on load balancers or proxies, while the following discussion focuses on stateful services.
High Availability
1. High‑Availability Solutions
High‑availability has evolved through several stages:
Cold backup
Dual‑machine hot standby
Same‑city active‑active
Remote active‑active
Remote multi‑active
Understanding earlier solutions helps explain the design rationale of later architectures.
Cold Backup
Cold backup copies data files while the database is offline, often using simple file copy commands (e.g., cp on Linux). It can be triggered manually or via scheduled scripts.
Simple to implement
Fast backup compared to other methods
Quick restoration by copying files back or adjusting configuration
Point‑in‑time recovery possible
However, cold backup has significant drawbacks in modern environments:
Requires service downtime, which is unacceptable for globally available applications
Data loss between backup and restore times, requiring manual log replay or request replay
Full‑volume backups waste storage and are time‑consuming
Infeasible to back up terabytes of data daily with portable media
Balancing these pros and cons is essential for each business.
Dual‑Machine Hot Standby
Hot standby performs backup while the service remains online, but restoration still requires downtime. This discussion excludes shared‑disk approaches.
Active/Standby Mode
One primary node serves traffic while a secondary node acts as a backup. Data is synchronized from primary to secondary via software (e.g., MySQL master/slave binlog replication, SQL Server transactional replication) or hardware (disk mirroring). Software‑level replication is often called application‑level disaster recovery; hardware mirroring is data‑level disaster recovery.
Dual‑Machine Mutual Standby
Both machines act as active/standby for different services, enabling read‑write separation and better resource utilization. This pattern can be extended with database deployment modes such as MySQL master‑master, MHA, Redis master/slave, Sentinel, or Cluster.
Same‑City Active‑Active
This approach extends hot standby across data centers within the same city, protecting against an entire IDC failure (e.g., power outage). It is similar to dual‑machine hot standby but with greater geographic distance, typically using dedicated city‑level links.
Some systems achieve true active‑active operation with dual masters handling both reads and writes, provided conflict resolution is carefully managed.
3. Remote Active‑Active
Same‑city active‑active cannot handle large‑scale disasters; remote active‑active deploys front‑end entry points and applications in a second city. When the primary city fails, traffic is redirected to the secondary city, albeit with higher latency and reduced user experience.
Most internet companies adopt remote active‑active for disaster resilience.
Remote Multi‑Active
Building on remote active‑active, remote multi‑active adds additional nodes to form a mesh where any node can fail without impacting service. This introduces challenges such as increased synchronization latency, data conflicts, and the need for distributed locks or eventual consistency mechanisms.
For applications with strict consistency requirements, a Global Zone solution directs all writes to a single master data center while allowing reads from any replica, achieving strong consistency without exposing complexity to the business layer. —《Ele.me Remote Multi‑Active Technical Implementation (Part 1)》
In practice, remote multi‑active often evolves into remote multi‑active with sharding and unit‑based partitioning, as illustrated by Alibaba’s and Taobao’s architectures.
These designs demand powerful underlying capabilities such as high‑throughput data transfer, robust data validation, and simplified client‑side write/sync control.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
