Operations 16 min read

Mastering High Availability: From Cold Backups to Multi‑Region Active‑Active

This article examines backend service high‑availability strategies, comparing cold backups, hot standby, same‑city and cross‑city active‑active designs, and explains the trade‑offs, architectural patterns, and practical considerations for building resilient distributed systems.

Efficient Ops
Efficient Ops
Efficient Ops
Mastering High Availability: From Cold Backups to Multi‑Region Active‑Active

Preface

Backend services can be classified as stateful or stateless. High availability is straightforward for stateless services using load balancers, but the following analysis focuses on stateful services.

State is typically persisted on disk or in memory databases such as MySQL or Redis; JVM memory can also hold state but its lifecycle is short.

High Availability

1. Common HA Solutions

High‑availability solutions have evolved through several stages:

Cold backup

Dual‑machine hot standby

Same‑city active‑active

Cross‑city active‑active

Cross‑city multi‑active

Understanding earlier solutions helps explain the rationale behind later designs.

Cold Backup

Cold backup copies data files while the database service is stopped, often using simple file copy commands (e.g.,

cp

on Linux). It can be performed manually or via scheduled scripts and offers several benefits:

Simple implementation

Fast backup compared with other methods

Rapid restoration by copying files back or adjusting configuration; two

mv

commands can complete the restore instantly

Point‑in‑time recovery, useful for undoing incidents such as coupon‑exploitation bugs

However, cold backup has significant drawbacks for modern workloads:

Service downtime is required, making it unsuitable for 24/7 global applications

Potential data loss between the backup point and the restoration time, requiring manual log replay or business‑log replay

Full‑volume backups waste disk space and are time‑consuming; selective table backups are not possible

Copying terabytes of data to external media is impractical

Balancing these pros and cons is essential for each business.

Dual‑Machine Hot Standby

Hot standby avoids downtime during backup but still requires a pause for restoration. The discussion excludes shared‑disk approaches.

Active/Standby Mode

One primary node serves traffic while a secondary node acts as a backup. Data is synchronized from primary to secondary via software (e.g., MySQL master/slave binlog replication, SQL Server transactional replication) or hardware (disk mirroring, sector interception). Software‑level sync is often called application‑level disaster recovery; hardware‑level sync is data‑level disaster recovery.

Dual‑Machine Mutual Standby

Both machines act as primary for different services, enabling read‑write separation and better resource utilization. For example, service A runs on node P (primary) with node Q as standby, while service B runs on Q with P as standby.

Other HA options include various database deployment modes such as MySQL master‑slave, master‑master, MHA, and Redis master‑slave, sentinel, or cluster.

Same‑City Active‑Active

This pattern extends HA across two data centers within the same city, protecting against an entire IDC failure (e.g., power outage). It is similar to dual‑machine hot standby but with greater distance; latency remains low due to dedicated city‑level links.

Some applications achieve true active‑active operation with custom conflict‑resolution logic, though not all workloads can support this.

Industry practice often adopts a “two‑site three‑center” model: two primary data centers (IDC1, IDC2) and a remote backup center (IDC3). Traffic is load‑balanced to the primary sites; if one fails, traffic fails over to the other site, and the remote center serves as a disaster‑recovery backup.

The diagram shows load balancers directing Service A to IDC1 and Service B to IDC2, with synchronous replication between same‑city sites and asynchronous replication to the remote IDC3. If an IDC fails, traffic is redirected to the surviving same‑city site; if both same‑city sites fail, the remote site takes over, albeit with higher latency.

This illustrates the master‑slave topology of the “two‑site three‑center” architecture.

3. Cross‑City Active‑Active

Same‑city active‑active cannot handle large‑scale disasters such as regional power outages. Extending the architecture to another city provides a fallback, but user experience degrades significantly due to increased latency.

Most internet companies adopt cross‑city active‑active solutions.

The diagram shows load balancers distributing traffic to two city‑level clusters; each cluster accesses its local database cluster, and only when all local databases are unavailable does traffic fail over to the remote cluster.

Cross‑city synchronization incurs higher latency, reducing throughput and increasing the chance of data conflicts. Solutions include distributed locks, eventual consistency with retry mechanisms, or sharding data to minimize cross‑city writes.

For strict consistency requirements, some companies (e.g., Ele.me) use a “Global Zone” design: all writes go to a single master data center, while reads can be served from local slaves, ensuring strong consistency without cross‑region write conflicts.

For applications demanding high consistency, a strong‑consistency solution (Global Zone) directs all writes to a master data center, while reads may be served locally, leveraging a database access layer that hides the complexity from the business logic. —《Ele.me Cross‑Region Multi‑Active Technical Implementation (Part 1) Overview》

Cross‑Region Multi‑Active

Building on the cross‑city active‑active concept, a multi‑active architecture connects each node to four others, so any single node failure does not affect the service. However, the increased distance for write operations adds latency and raises conflict risk.

Transforming a mesh topology into a star topology by introducing a central hub reduces synchronization overhead. The central hub bears higher reliability requirements, while peripheral nodes can fail without service impact.

In this star topology, traffic is load‑balanced to the nearest city; only the central node handles full data synchronization, simplifying consistency management.

Many large‑scale services (e.g., ride‑hailing, e‑commerce) adopt similar patterns, often combining sharding, micro‑service decomposition, and selective data replication to achieve both availability and performance.

Implementing such architectures requires extensive changes to code, testing, and operations, including distributed transaction handling, cache invalidation, and automated disaster‑recovery drills.

Cross‑region multi‑active demands strong underlying capabilities such as reliable data transfer, verification, and a simplified client‑side data‑operation layer.

High Availabilitydisaster recoverycold backupactive standbycross‑region active‑active
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.