Operations 15 min read

How to Achieve High Availability for Stateful Backend Services: From Cold Backup to Multi‑Active

This article explains the evolution of high‑availability strategies for stateful backend services, comparing cold backup, dual‑machine hot standby, same‑city active‑active, cross‑city active‑active and multi‑active solutions, and discusses their trade‑offs, implementation details, and practical considerations.

ITPUB
ITPUB
ITPUB
How to Achieve High Availability for Stateful Backend Services: From Cold Backup to Multi‑Active

Stateful Service High‑Availability Overview

Backend services are classified as stateless or stateful . Stateless services achieve high availability (HA) easily with load‑balancers (e.g., F5). The focus here is on stateful services that keep data on disk (MySQL, PostgreSQL) or in memory (Redis, Memcached) and on short‑lived JVM memory state.

Cold Backup

A cold backup stops the database, copies the data files (typically with cp on Linux), and stores them in a backup location. It can be triggered manually or via scheduled scripts.

Simple to implement.

Fast backup compared with many incremental methods.

Rapid restore by copying files back or by moving the data directory with mv.

Supports point‑in‑time recovery (e.g., restore to a moment before a known incident).

Drawbacks for always‑online services:

Requires service downtime, making "nine‑nine" availability impossible.

Data loss can occur between the backup point and the failure; manual log replay (redo logs, business logs) is often needed.

Full‑volume backups waste disk space and take long; selective table backups are not feasible.

Dual‑Machine Hot Standby

Hot standby keeps the primary service running while replicating data to a standby node. A brief outage is still required for failover.

Active/Standby Mode

One primary node serves traffic; a backup node receives synchronized data. When the primary fails, the standby becomes active. Synchronization can be:

Software‑level : MySQL master‑slave via binlog, SQL Server transactional replication, etc.

Hardware‑level : Disk mirroring or sector‑level interception (data‑level disaster recovery).

Dual‑Machine Mutual Backup

Essentially two active/standby pairs with reversed roles for different workloads, enabling read‑write separation and better resource utilization across two machines.

Same‑City Active‑Active

Extends HA across two data centers within the same city, protecting against an entire IDC failure (power loss, network outage). With proper application design, both sites can read and write concurrently, though not all workloads support true active‑active operation.

Cross‑City Active‑Active

When a single city cannot guarantee continuity (e.g., large‑scale power outages or natural disasters), traffic can be redirected to a distant backup city. This introduces higher latency and reduced user experience but provides stronger disaster resilience.

Cross‑city active‑active diagram
Cross‑city active‑active diagram

Cross‑City Multi‑Active

Multi‑active expands the active‑active concept to more than two locations. Each node connects to multiple peers so that any single node failure does not affect service. The trade‑offs are higher write latency and increased data‑conflict risk, requiring strategies such as distributed locks, sharding, or eventual consistency.

A common practical approach is to centralize writes in a single “Global Zone” (master data center) while allowing reads from any zone, thereby reducing conflict risk.

For applications with strict consistency requirements, a Global Zone enforces writes to a single master data center; reads can be served locally or bound to the master via a database access layer, keeping the application unaware of the routing.

Multi‑active is often a transitional step toward full multi‑active deployment; it still faces conflict resolution and limited horizontal scalability.

Key Design Considerations

Latency vs. Consistency : Longer geographic distance increases write latency; choose between strong consistency (global zone) and higher throughput (eventual consistency).

Conflict Resolution : Use distributed locks, two‑phase commit, or sharding to minimize write‑write conflicts.

Resource Utilization : Dual‑machine mutual backup enables read‑write separation; active‑active can improve capacity if the workload supports concurrent writes.

Disaster Scope : Same‑city active‑active protects against IDC‑level failures; cross‑city active‑active protects against city‑wide disasters; multi‑active adds resilience against multiple site failures.

When designing HA for stateful services, evaluate the required availability level, acceptable latency, data‑consistency guarantees, and operational complexity to select the appropriate pattern—from simple cold backup to cross‑city multi‑active with a global write zone.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

System Designdisaster recoverymulti-activecold backupactive standby
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.