Operations 16 min read

Mastering High Availability: From Cold Backups to Multi‑Active Disaster Recovery

This article explores the evolution of high‑availability strategies for stateful backend services, comparing cold backups, active/standby, same‑city and cross‑city active‑active setups, and discusses the trade‑offs, design considerations, and real‑world implementations of multi‑active and multi‑active architectures.

Efficient Ops
Efficient Ops
Efficient Ops
Mastering High Availability: From Cold Backups to Multi‑Active Disaster Recovery

Preface

Backend services can be classified as stateless or stateful. High availability is straightforward for stateless applications, which can rely on load balancers like F5, but the following discussion focuses on stateful services.

State is typically persisted on disk or in memory, using databases such as MySQL, Redis, or JVM memory (which has a short lifecycle).

High Availability

1. Some HA Solutions

High‑availability has evolved through several stages:

Cold backup

Active/standby (dual‑machine hot backup)

Same‑city active‑active

Cross‑city active‑active

Cross‑city multi‑active

Before discussing cross‑city multi‑active, it helps to understand earlier solutions.

Cold Backup

Cold backup copies data files while the database is offline, often using simple file copy commands (e.g.,

cp

on Linux). Benefits include simplicity, fast backup, quick recovery, and point‑in‑time restoration.

Simple

Fast backup compared to other methods

Fast recovery – copy files back or adjust configuration; even two

mv

commands can restore instantly

Point‑in‑time recovery – useful for incidents like coupon exploits

However, cold backup has drawbacks:

Requires service downtime, which is unacceptable for 24/7 global services

Potential data loss between backup and restoration; manual log replay is labor‑intensive

Full‑copy consumes excessive disk space and time

Impractical for large data volumes (multiple terabytes) and lacks selective backup

Balancing these pros and cons is essential for each business.

Active/Standby (Dual‑Machine Hot Backup)

Hot backup allows continuous service while backing up data, but restoration still requires downtime. This discussion excludes shared‑disk approaches.

Active/Standby Mode

One primary node serves traffic while a secondary node acts as backup. Data is synchronized via software (e.g., MySQL master/slave binlog, SQL Server replication) or hardware (disk mirroring). Software‑level sync is often called application‑level disaster recovery; hardware‑level sync is data‑level disaster recovery.

Dual‑Machine Mutual Backup

Essentially Active/Standby with roles swapped, allowing better resource utilization and read‑write separation when deploying different services on each machine.

Other HA options include various MySQL deployment modes (master‑slave, master‑master, MHA) and Redis setups (master‑slave, Sentinel, Cluster).

Same‑City Active‑Active

This extends previous solutions across an entire data center, protecting against a single IDC failure (e.g., power outage). It resembles dual‑machine hot backup but with greater distance; latency remains low due to dedicated links.

Some applications achieve true active‑active with conflict resolution, though not all workloads can support it.

Industry practice often adopts a “two‑site three‑center” model: two local data centers provide primary service, while a remote center serves as disaster‑recovery only. Traffic is load‑balanced, and failover switches to the remote center when a local site fails, though latency may increase.

In the “two‑site three‑center” diagram, traffic is distributed via load balancers to IDC1 and IDC2; both sync data to IDC3. If any IDC fails, traffic is redirected to the remaining site.

The diagram shows a master‑slave based three‑center architecture, where two local sites act as master‑slave and the remote site as backup.

3. Cross‑City Active‑Active

Same‑city active‑active handles most disaster scenarios, but large‑scale outages (e.g., natural disasters) still cause service interruption. Extending the architecture across cities allows traffic to fail over to another city, albeit with degraded user experience.

Most internet companies adopt cross‑city active‑active.

The simple cross‑city active‑active diagram shows load balancers directing traffic to two city clusters, each with its own local database cluster. Failover occurs only when the local databases become unavailable.

Cross‑city synchronization introduces higher latency, reducing throughput and increasing conflict risk. Solutions include distributed locks, eventual consistency, sharding, and intermediate states with retries.

For strict consistency requirements, Ele.me uses a “Global Zone” design: writes are directed to a single master data center, while reads can be served locally, ensuring strong consistency.

For applications demanding strong consistency, we provide a Global Zone solution that centralizes writes to a master data center while allowing reads from any slave, based on our Database Access Layer (DAL), making the process transparent to business logic. —《Ele.me Cross‑Region Multi‑Active Technical Implementation (Part 1) Overview》

Cross‑city active‑active is essentially a temporary step toward cross‑city multi‑active, which offers better scalability but introduces more complexity.

Cross‑City Multi‑Active

The diagram illustrates a mesh topology where each node connects to four others, providing resilience against any single node failure. However, the increased distance for write operations leads to higher latency and more conflicts.

Optimizing the mesh into a star topology reduces synchronization overhead:

In this star layout, each city can fail without affecting data integrity; traffic is rerouted to the nearest city. The central node bears higher reliability requirements (fast recovery, complete backups).

Alibaba’s envisioned multi‑active architecture places writes in a single city while reads are distributed, similar to the “Global Zone” concept.

Large e‑commerce platforms like Taobao adopt a unit‑based split: transactional units synchronize bidirectionally with a central unit, while non‑transactional data syncs unidirectionally, allowing elastic scaling for business units and robust stability for the central unit.

Implementing such architectures requires extensive code refactoring, distributed transaction handling, cache invalidation, and sophisticated testing and operations pipelines.

In summary, cross‑city multi‑active demands strong foundational capabilities such as data transfer, verification, and a simplified client‑side write/sync layer.

Source: https://blog.dogchao.cn/?p=299

backend architectureHigh Availabilitydisaster recoverymulti-activeactive-activeactive standby
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.