High‑Availability Strategies for Stateful Backend Services: Cold Backup, Dual‑Machine Active/Standby, Same‑City and Cross‑City Active‑Active, and Multi‑Active Architectures
The article explains various high‑availability solutions for stateful backend services, comparing cold backup, dual‑machine active/standby, same‑city active‑active, cross‑city active‑active, and cross‑city multi‑active approaches, and discusses their trade‑offs, implementation details, and real‑world examples from large internet companies.
The author, a senior architect, shares practical insights on achieving high availability for stateful backend services, emphasizing that while stateless services are easy to keep highly available, stateful services require more sophisticated strategies.
High Availability
1. Some HA Solutions
High availability has evolved through several stages:
Cold backup
Dual‑machine hot standby
Same‑city active‑active
Cross‑city active‑active
Cross‑city multi‑active
Cold Backup
Cold backup copies data files (e.g., using cp on Linux) after stopping the database service. Its advantages are simplicity, fast backup and restore, and point‑in‑time recovery.
Simple
Fast backup compared with other methods
Fast restore – just copy files back or switch the data directory
Can restore to a specific point in time
However, cold backup requires service downtime, can lose data between the backup point and the failure, performs full backups that waste storage, and is impractical for large‑scale, always‑online services.
Dual‑Machine Hot Standby
Hot standby avoids downtime during backup but still requires a stop for failover. It is essentially a primary‑secondary (active/standby) setup where data is synchronized from the primary to the standby.
Active/Standby Mode
The primary node serves traffic while the standby acts as a backup. Data can be synchronized at the software level (e.g., MySQL master/slave, SQL Server replication) or at the hardware level (disk mirroring). The discussion focuses on software‑level (application‑level) disaster recovery.
Dual‑Machine Mutual Backup
Both machines act as primary for different services, allowing read‑write separation and better resource utilization, but they cannot serve the same business simultaneously.
Other HA options include various database deployment modes such as MySQL master‑slave, dual‑master, MHA, Redis master‑slave, Sentinel, and Cluster.
2. Same‑City Active‑Active
This solution replicates services across two data centers within the same city, protecting against an entire IDC failure. It is similar to dual‑machine hot standby but with greater distance; traffic is load‑balanced between the two sites.
Some businesses achieve true active‑active (both sites handling reads and writes) by handling conflict resolution in the application layer.
Many companies adopt a “two‑site three‑center” model: two active sites in the same city and a remote disaster‑recovery site that only stores data and takes over when both active sites fail.
When a city‑wide outage occurs, the remote site preserves data, and traffic can be switched to it, though latency may increase.
3. Cross‑City Active‑Active
Cross‑city active‑active extends same‑city active‑active to geographically distant locations, providing stronger disaster resilience but incurring higher network latency and potential data conflicts.
One design uses a star topology with a central hub that synchronizes with all cities, reducing the impact of any single node failure.
However, this introduces heavy synchronization traffic and conflict resolution challenges, often requiring distributed locks or sharding strategies.
Global Zone (Strong Consistency) Solution
For applications with strict consistency requirements, a Global Zone directs all writes to a single master data center while allowing reads from any replica, achieving strong consistency with minimal impact on the business layer. —《饿了么异地多活技术实现(一)总体介绍》
Most large internet companies, such as Ele.me, adopt multi‑active architectures that combine these patterns, balancing consistency, latency, and operational complexity.
In summary, cross‑city multi‑active requires robust infrastructure for data transfer, verification, and client‑side control, and is typically only feasible for enterprises with significant resources.
最近面试BAT,整理一份面试资料《Java面试BAT通关手册》
获取方式:点“在看”,关注公众号并回复 手册 领取,更多内容陆续奉上。Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.