Mastering High Availability: From Cold Backup to Multi‑Active Architecture

This article examines high‑availability strategies for stateful backend services, covering cold backup, dual‑machine hot standby, same‑city active‑active, and remote multi‑active solutions, while discussing their benefits, trade‑offs, and architectural patterns for resilient distributed systems.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Mastering High Availability: From Cold Backup to Multi‑Active Architecture

Preface

Backend services can be classified as stateless or stateful. High availability is straightforward for stateless applications, which can rely on load balancers or proxies, while the following discussion focuses on stateful services.

High Availability

1. High‑Availability Solutions

High‑availability has evolved through several stages:

Cold backup

Dual‑machine hot standby

Same‑city active‑active

Remote active‑active

Remote multi‑active

Understanding earlier solutions helps explain the design rationale of later architectures.

Cold Backup

Cold backup copies data files while the database is offline, often using simple file copy commands (e.g., cp on Linux). It can be triggered manually or via scheduled scripts.

Simple to implement

Fast backup compared to other methods

Quick restoration by copying files back or adjusting configuration

Point‑in‑time recovery possible

However, cold backup has significant drawbacks in modern environments:

Requires service downtime, which is unacceptable for globally available applications

Data loss between backup and restore times, requiring manual log replay or request replay

Full‑volume backups waste storage and are time‑consuming

Infeasible to back up terabytes of data daily with portable media

Balancing these pros and cons is essential for each business.

Dual‑Machine Hot Standby

Hot standby performs backup while the service remains online, but restoration still requires downtime. This discussion excludes shared‑disk approaches.

Active/Standby Mode

One primary node serves traffic while a secondary node acts as a backup. Data is synchronized from primary to secondary via software (e.g., MySQL master/slave binlog replication, SQL Server transactional replication) or hardware (disk mirroring). Software‑level replication is often called application‑level disaster recovery; hardware mirroring is data‑level disaster recovery.

Dual‑Machine Mutual Standby

Both machines act as active/standby for different services, enabling read‑write separation and better resource utilization. This pattern can be extended with database deployment modes such as MySQL master‑master, MHA, Redis master/slave, Sentinel, or Cluster.

Same‑City Active‑Active

This approach extends hot standby across data centers within the same city, protecting against an entire IDC failure (e.g., power outage). It is similar to dual‑machine hot standby but with greater geographic distance, typically using dedicated city‑level links.

Some systems achieve true active‑active operation with dual masters handling both reads and writes, provided conflict resolution is carefully managed.

3. Remote Active‑Active

Same‑city active‑active cannot handle large‑scale disasters; remote active‑active deploys front‑end entry points and applications in a second city. When the primary city fails, traffic is redirected to the secondary city, albeit with higher latency and reduced user experience.

Most internet companies adopt remote active‑active for disaster resilience.

Remote Multi‑Active

Building on remote active‑active, remote multi‑active adds additional nodes to form a mesh where any node can fail without impacting service. This introduces challenges such as increased synchronization latency, data conflicts, and the need for distributed locks or eventual consistency mechanisms.

For applications with strict consistency requirements, a Global Zone solution directs all writes to a single master data center while allowing reads from any replica, achieving strong consistency without exposing complexity to the business layer. —《Ele.me Remote Multi‑Active Technical Implementation (Part 1)》

In practice, remote multi‑active often evolves into remote multi‑active with sharding and unit‑based partitioning, as illustrated by Alibaba’s and Taobao’s architectures.

These designs demand powerful underlying capabilities such as high‑throughput data transfer, robust data validation, and simplified client‑side write/sync control.

Two‑city three‑center diagram
Two‑city three‑center diagram
Two‑city three‑center master‑slave mode
Two‑city three‑center master‑slave mode
Simple remote active‑active diagram
Simple remote active‑active diagram
Remote multi‑active diagram
Remote multi‑active diagram
Alibaba ideal remote multi‑active architecture
Alibaba ideal remote multi‑active architecture
Taobao unit‑based remote multi‑active architecture
Taobao unit‑based remote multi‑active architecture
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Backend Architecturehigh availabilitydisaster recoverymulti-activecold backupactive standby
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.