Operations 17 min read

How to Achieve High Availability for Stateful Backend Services?

This article explores various high‑availability strategies for stateful backend services, comparing cold backup, active/standby, same‑city active‑active, and multi‑site active‑active solutions, discussing their benefits, limitations, and practical implementation examples from large‑scale internet companies.

Programmer DD
Programmer DD
Programmer DD
How to Achieve High Availability for Stateful Backend Services?

Preface

Backend services can be divided into two categories: stateless and stateful. High availability for stateless applications is relatively simple—using load balancers such as F5 or any proxy can solve the problem. The following sections focus on stateful services.

Stateful services maintain their state via disk or memory, e.g., MySQL, Redis, or JVM memory (which usually has a short lifecycle).

High Availability

1. Some High‑Availability Solutions

From a historical perspective, high‑availability has evolved through the following stages:

Cold backup

Active/standby (dual‑machine hot standby)

Same‑city active‑active

Cross‑city active‑active

Cross‑city multi‑active

Before discussing cross‑city multi‑active, it is useful to review the earlier solutions to understand their design motivations.

Cold Backup

Cold backup stops the database service and copies data files (e.g., using the cp command on Linux). It can be performed manually or via scheduled scripts and offers several advantages:

Simple

Fast backup compared with other methods

Quick recovery—copy the backup files back to the working directory or adjust the database configuration; two mv commands can complete the restore instantly

Point‑in‑time recovery—useful for incidents such as the Pinduoduo coupon vulnerability

However, cold backup has significant drawbacks in modern scenarios:

Service downtime—continuous 9‑s availability is impossible; global services cannot schedule downtime during low‑traffic periods

Data loss between backup and restore—manual log replay or redo‑log recovery is labor‑intensive and error‑prone

Full‑volume backup consumes excessive disk space and time; selective table backup is not feasible

Copying terabytes of data to external storage is impractical

Balancing these pros and cons is a business‑specific decision.

Active/Standby (Dual‑Machine Hot Standby)

Hot standby differs from cold backup by allowing continuous service while backing up, though a failover still requires a brief outage. Shared‑disk approaches are excluded from this discussion.

Active/Standby Mode

This is a classic 1‑master‑1‑slave setup: the master serves traffic, the standby synchronizes data and can take over if the master fails. Synchronization can be software‑based (e.g., MySQL binlog replication, SQL Server transactional replication) or hardware‑based (disk mirroring). Software‑level is often called application‑level disaster recovery; hardware‑level is data‑level disaster recovery.

Dual‑Machine Mutual Backup

Essentially the same Active/Standby concept, but each machine acts as master for a different business, enabling read‑write separation and better resource utilization.

Other HA options include MySQL master‑slave, master‑master, MHA; Redis master‑slave, Sentinel, Cluster, etc.

2. Same‑City Active‑Active

Same‑city active‑active extends the previous solutions across data centers within a city, protecting against an entire IDC failure (power outage, network cut). The architecture is similar to dual‑machine hot standby but with greater distance; latency remains low.

With proper code support, true active‑active (dual‑master with conflict resolution) is possible, though not all applications can handle it.

Many companies adopt a “two‑site‑three‑center” model: two active data centers in a city and a third remote backup center for disaster recovery. Traffic is load‑balanced to the active sites, and data is synchronized via dedicated links. If one city fails, traffic fails over to the other city; if both cities fail, the remote center takes over, albeit with higher latency.

Two‑site‑three‑center diagram
Two‑site‑three‑center diagram

When a city experiences a large‑scale outage (e.g., earthquake), the remote center preserves data, but user experience degrades due to increased latency.

Two‑site‑three‑center master‑slave mode
Two‑site‑three‑center master‑slave mode

3. Cross‑City Active‑Active

Same‑city active‑active handles most disaster scenarios, but large‑scale events (regional power loss, natural disasters) still cause outages. Extending the two‑city architecture to include cross‑city active‑active allows traffic to shift to a distant city, sacrificing some user experience for continuity.

Simple cross‑city active‑active diagram
Simple cross‑city active‑active diagram

In this setup, traffic is load‑balanced to both cities; each city’s servers connect only to local databases. Only when both local databases become unavailable does traffic fail over to the remote database cluster, incurring higher latency and potential throughput loss.

To mitigate conflicts, techniques such as distributed locks, distributed transactions, sharding, or eventual consistency are employed.

For applications with strict consistency requirements, a “Global Zone” solution can be used: writes are directed to a single master data center, while reads are served from slaves or bound to the master, all transparent to the business layer. —《Ele.me Cross‑Region Multi‑Active Technical Implementation (Part 1) Overview》

Thus, cross‑city active‑active is a stepping stone toward full cross‑region multi‑active, which provides higher resilience but introduces data‑conflict and latency challenges.

Cross‑Region Multi‑Active

Cross‑region multi‑active diagram
Cross‑region multi‑active diagram

The design connects each node with four inbound/outbound links, ensuring that any single node failure does not affect the service. However, longer write paths increase latency and data‑conflict risk, reducing throughput. Solutions include distributed locks, retry mechanisms, or sharding to keep transactions local.

Alibaba’s “Global Zone” architecture isolates writes to a master zone to guarantee strong consistency, while reads are distributed across zones.

Alibaba Global Zone architecture
Alibaba Global Zone architecture

Many businesses, such as ride‑hailing, can shard by city, allowing each data center to operate independently with occasional synchronization for reporting.

E‑commerce platforms, however, have complex inter‑dependencies. Taobao’s solution partitions by business unit, with a central unit handling the most complex scenarios and peripheral units being elastic and fault‑tolerant.

Taobao unit‑based cross‑region multi‑active
Taobao unit‑based cross‑region multi‑active

Implementing such architectures requires extensive code refactoring, distributed transaction handling, cache invalidation, and robust testing and operations pipelines.

In summary, the article illustrates that cross‑region multi‑active demands strong foundational capabilities such as data transfer, verification, and a simplified data‑access layer to manage writes and synchronization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Backend Architecturehigh availabilitydisaster recoveryActive-Activestateful services
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.