Backend Development 16 min read

High‑Availability Strategies for Stateful Backend Services: Cold Backup, Dual‑Machine Active/Standby, Same‑City and Cross‑City Active‑Active, and Multi‑Active Architectures

The article explains various high‑availability solutions for stateful backend services, comparing cold backup, dual‑machine active/standby, same‑city active‑active, cross‑city active‑active, and cross‑city multi‑active approaches, and discusses their trade‑offs, implementation details, and real‑world examples from large internet companies.

Top Architect
Top Architect
Top Architect
High‑Availability Strategies for Stateful Backend Services: Cold Backup, Dual‑Machine Active/Standby, Same‑City and Cross‑City Active‑Active, and Multi‑Active Architectures

The author, a senior architect, shares practical insights on achieving high availability for stateful backend services, emphasizing that while stateless services are easy to keep highly available, stateful services require more sophisticated strategies.

High Availability

1. Some HA Solutions

High availability has evolved through several stages:

Cold backup

Dual‑machine hot standby

Same‑city active‑active

Cross‑city active‑active

Cross‑city multi‑active

Cold Backup

Cold backup copies data files (e.g., using cp on Linux) after stopping the database service. Its advantages are simplicity, fast backup and restore, and point‑in‑time recovery.

Simple

Fast backup compared with other methods

Fast restore – just copy files back or switch the data directory

Can restore to a specific point in time

However, cold backup requires service downtime, can lose data between the backup point and the failure, performs full backups that waste storage, and is impractical for large‑scale, always‑online services.

Dual‑Machine Hot Standby

Hot standby avoids downtime during backup but still requires a stop for failover. It is essentially a primary‑secondary (active/standby) setup where data is synchronized from the primary to the standby.

Active/Standby Mode

The primary node serves traffic while the standby acts as a backup. Data can be synchronized at the software level (e.g., MySQL master/slave, SQL Server replication) or at the hardware level (disk mirroring). The discussion focuses on software‑level (application‑level) disaster recovery.

Dual‑Machine Mutual Backup

Both machines act as primary for different services, allowing read‑write separation and better resource utilization, but they cannot serve the same business simultaneously.

Other HA options include various database deployment modes such as MySQL master‑slave, dual‑master, MHA, Redis master‑slave, Sentinel, and Cluster.

2. Same‑City Active‑Active

This solution replicates services across two data centers within the same city, protecting against an entire IDC failure. It is similar to dual‑machine hot standby but with greater distance; traffic is load‑balanced between the two sites.

Some businesses achieve true active‑active (both sites handling reads and writes) by handling conflict resolution in the application layer.

Many companies adopt a “two‑site three‑center” model: two active sites in the same city and a remote disaster‑recovery site that only stores data and takes over when both active sites fail.

When a city‑wide outage occurs, the remote site preserves data, and traffic can be switched to it, though latency may increase.

3. Cross‑City Active‑Active

Cross‑city active‑active extends same‑city active‑active to geographically distant locations, providing stronger disaster resilience but incurring higher network latency and potential data conflicts.

One design uses a star topology with a central hub that synchronizes with all cities, reducing the impact of any single node failure.

However, this introduces heavy synchronization traffic and conflict resolution challenges, often requiring distributed locks or sharding strategies.

Global Zone (Strong Consistency) Solution

For applications with strict consistency requirements, a Global Zone directs all writes to a single master data center while allowing reads from any replica, achieving strong consistency with minimal impact on the business layer. —《饿了么异地多活技术实现(一)总体介绍》

Most large internet companies, such as Ele.me, adopt multi‑active architectures that combine these patterns, balancing consistency, latency, and operational complexity.

In summary, cross‑city multi‑active requires robust infrastructure for data transfer, verification, and client‑side control, and is typically only feasible for enterprises with significant resources.

最近面试BAT,整理一份面试资料《Java面试BAT通关手册》
获取方式:点“在看”,关注公众号并回复 手册 领取,更多内容陆续奉上。
backend architecturehigh availabilitydisaster recoverymulti-activecold backupactive standby
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.