Operations 15 min read

Mastering High Availability: From Cold Backups to Multi‑Region Active‑Active Architectures

This article examines high‑availability strategies for stateful backend services, covering cold backups, active‑standby, same‑city and cross‑city active‑active, and multi‑active designs, while discussing their trade‑offs, implementation details, and real‑world enterprise examples.

Open Source Linux

Oct 8, 2022

Mastering High Availability: From Cold Backups to Multi‑Region Active‑Active Architectures

Preface

Backend services can be divided into stateless and stateful. Stateless services achieve high availability easily through load balancers, while this article focuses on stateful services.

State is typically persisted on disk or in memory databases such as MySQL, Redis, or JVM memory, which have relatively short lifetimes.

High Availability

1. Common HA Solutions

High‑availability has evolved through several stages:

Cold backup

Active/Standby (dual‑machine hot standby)

Same‑city active‑active

Cross‑city active‑active

Cross‑city multi‑active

Understanding earlier solutions helps explain later designs.

Cold Backup

Cold backup copies data files while the database is stopped, often using simple file‑copy commands (e.g., cp on Linux). It can be performed manually or via scheduled scripts and offers advantages such as simplicity, fast backup, rapid recovery, and point‑in‑time restore.

Simple

Fast backup compared with other methods

Quick recovery by copying files back or adjusting configuration

Ability to restore to a specific point in time

However, cold backup has drawbacks for modern services:

Requires service downtime, which is unacceptable for 24/7 global applications

Data loss between backup and restore times, requiring manual log replay

Full‑volume backup wastes storage and is time‑consuming

Large data volumes make copying impractical and cannot be selective

Balancing these pros and cons is a business decision.

Active/Standby (Dual‑Machine Hot Standby)

Hot standby replicates data while the service remains online; failover still requires a brief outage. The article excludes shared‑disk approaches.

Active/Standby Mode

One primary node serves traffic while a backup node synchronizes data via software (e.g., MySQL master/slave binlog replication, SQL Server transactional replication) or hardware (disk mirroring). Software replication is often called application‑level disaster recovery; hardware replication is data‑level disaster recovery.

Active‑Active Mutual Backup

Both machines act as primary for different services, enabling read‑write separation and better resource utilization, but they cannot serve the same business simultaneously.

Same‑City Active‑Active

Deploying two data centers within the same city mitigates a single‑site failure (power outage, network loss). The architecture resembles hot standby but with greater distance; latency remains low.

With code assistance, true active‑active can provide read‑write on both sites, though not all applications can support it.

Many companies adopt a “two‑site three‑center” model: two active sites and a remote disaster‑recovery site that only stores data and takes over when both active sites fail.

Cross‑City Active‑Active

When a large‑scale outage occurs, traffic can be switched to a remote city, sacrificing user experience but maintaining service continuity.

Most internet companies adopt cross‑city active‑active, despite higher latency and potential data conflicts.

Cross‑City Multi‑Active

Extending the active‑active concept, each node connects to a local database cluster; failover to a remote cluster occurs only when the local cluster is completely unavailable.

Longer synchronization times increase throughput loss and data conflicts. Solutions include distributed locks, eventual consistency, sharding, and specialized architectures such as “Global Zone” where writes are directed to a single master data center.

For applications with strict consistency requirements, a “Global Zone” provides cross‑region read‑write separation, routing all writes to a master data center while reads can be served locally. —《Ele.me Multi‑Region Active‑Active Technical Implementation (Part 1)》

Multi‑active is a stepping stone toward architectures that support horizontal scaling and reduce conflict risk.

Multi‑Active Architectures in Large Enterprises

Examples from Alibaba and Taobao illustrate how business units are sharded, with a central unit handling complex transactions and peripheral units handling simpler workloads. This requires extensive code refactoring, distributed transaction handling, and robust testing and operations.

Implementing such disaster‑recovery levels demands strong foundational capabilities: data transfer, verification, and a simplified data access layer.

Source: https://blog.dogchao.cn/?p=299

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Disaster Recovery Active-Active multi-region cold backup active standby

Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.