How to Achieve High Availability for Stateful Backend Services: From Cold Backup to Multi‑Active
This article explains the evolution of high‑availability strategies for stateful backend services, comparing cold backup, dual‑machine hot standby, same‑city active‑active, cross‑city active‑active and multi‑active solutions, and discusses their trade‑offs, implementation details, and practical considerations.
Stateful Service High‑Availability Overview
Backend services are classified as stateless or stateful . Stateless services achieve high availability (HA) easily with load‑balancers (e.g., F5). The focus here is on stateful services that keep data on disk (MySQL, PostgreSQL) or in memory (Redis, Memcached) and on short‑lived JVM memory state.
Cold Backup
A cold backup stops the database, copies the data files (typically with cp on Linux), and stores them in a backup location. It can be triggered manually or via scheduled scripts.
Simple to implement.
Fast backup compared with many incremental methods.
Rapid restore by copying files back or by moving the data directory with mv.
Supports point‑in‑time recovery (e.g., restore to a moment before a known incident).
Drawbacks for always‑online services:
Requires service downtime, making "nine‑nine" availability impossible.
Data loss can occur between the backup point and the failure; manual log replay (redo logs, business logs) is often needed.
Full‑volume backups waste disk space and take long; selective table backups are not feasible.
Dual‑Machine Hot Standby
Hot standby keeps the primary service running while replicating data to a standby node. A brief outage is still required for failover.
Active/Standby Mode
One primary node serves traffic; a backup node receives synchronized data. When the primary fails, the standby becomes active. Synchronization can be:
Software‑level : MySQL master‑slave via binlog, SQL Server transactional replication, etc.
Hardware‑level : Disk mirroring or sector‑level interception (data‑level disaster recovery).
Dual‑Machine Mutual Backup
Essentially two active/standby pairs with reversed roles for different workloads, enabling read‑write separation and better resource utilization across two machines.
Same‑City Active‑Active
Extends HA across two data centers within the same city, protecting against an entire IDC failure (power loss, network outage). With proper application design, both sites can read and write concurrently, though not all workloads support true active‑active operation.
Cross‑City Active‑Active
When a single city cannot guarantee continuity (e.g., large‑scale power outages or natural disasters), traffic can be redirected to a distant backup city. This introduces higher latency and reduced user experience but provides stronger disaster resilience.
Cross‑City Multi‑Active
Multi‑active expands the active‑active concept to more than two locations. Each node connects to multiple peers so that any single node failure does not affect service. The trade‑offs are higher write latency and increased data‑conflict risk, requiring strategies such as distributed locks, sharding, or eventual consistency.
A common practical approach is to centralize writes in a single “Global Zone” (master data center) while allowing reads from any zone, thereby reducing conflict risk.
For applications with strict consistency requirements, a Global Zone enforces writes to a single master data center; reads can be served locally or bound to the master via a database access layer, keeping the application unaware of the routing.
Multi‑active is often a transitional step toward full multi‑active deployment; it still faces conflict resolution and limited horizontal scalability.
Key Design Considerations
Latency vs. Consistency : Longer geographic distance increases write latency; choose between strong consistency (global zone) and higher throughput (eventual consistency).
Conflict Resolution : Use distributed locks, two‑phase commit, or sharding to minimize write‑write conflicts.
Resource Utilization : Dual‑machine mutual backup enables read‑write separation; active‑active can improve capacity if the workload supports concurrent writes.
Disaster Scope : Same‑city active‑active protects against IDC‑level failures; cross‑city active‑active protects against city‑wide disasters; multi‑active adds resilience against multiple site failures.
When designing HA for stateful services, evaluate the required availability level, acceptable latency, data‑consistency guarantees, and operational complexity to select the appropriate pattern—from simple cold backup to cross‑city multi‑active with a global write zone.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
