How to Achieve High Availability for Stateful Backend Services?
This article explores various high‑availability strategies for stateful backend services, comparing cold backup, active/standby, same‑city active‑active, and multi‑site active‑active solutions, discussing their benefits, limitations, and practical implementation examples from large‑scale internet companies.
Preface
Backend services can be divided into two categories: stateless and stateful. High availability for stateless applications is relatively simple—using load balancers such as F5 or any proxy can solve the problem. The following sections focus on stateful services.
Stateful services maintain their state via disk or memory, e.g., MySQL, Redis, or JVM memory (which usually has a short lifecycle).
High Availability
1. Some High‑Availability Solutions
From a historical perspective, high‑availability has evolved through the following stages:
Cold backup
Active/standby (dual‑machine hot standby)
Same‑city active‑active
Cross‑city active‑active
Cross‑city multi‑active
Before discussing cross‑city multi‑active, it is useful to review the earlier solutions to understand their design motivations.
Cold Backup
Cold backup stops the database service and copies data files (e.g., using the cp command on Linux). It can be performed manually or via scheduled scripts and offers several advantages:
Simple
Fast backup compared with other methods
Quick recovery—copy the backup files back to the working directory or adjust the database configuration; two mv commands can complete the restore instantly
Point‑in‑time recovery—useful for incidents such as the Pinduoduo coupon vulnerability
However, cold backup has significant drawbacks in modern scenarios:
Service downtime—continuous 9‑s availability is impossible; global services cannot schedule downtime during low‑traffic periods
Data loss between backup and restore—manual log replay or redo‑log recovery is labor‑intensive and error‑prone
Full‑volume backup consumes excessive disk space and time; selective table backup is not feasible
Copying terabytes of data to external storage is impractical
Balancing these pros and cons is a business‑specific decision.
Active/Standby (Dual‑Machine Hot Standby)
Hot standby differs from cold backup by allowing continuous service while backing up, though a failover still requires a brief outage. Shared‑disk approaches are excluded from this discussion.
Active/Standby Mode
This is a classic 1‑master‑1‑slave setup: the master serves traffic, the standby synchronizes data and can take over if the master fails. Synchronization can be software‑based (e.g., MySQL binlog replication, SQL Server transactional replication) or hardware‑based (disk mirroring). Software‑level is often called application‑level disaster recovery; hardware‑level is data‑level disaster recovery.
Dual‑Machine Mutual Backup
Essentially the same Active/Standby concept, but each machine acts as master for a different business, enabling read‑write separation and better resource utilization.
Other HA options include MySQL master‑slave, master‑master, MHA; Redis master‑slave, Sentinel, Cluster, etc.
2. Same‑City Active‑Active
Same‑city active‑active extends the previous solutions across data centers within a city, protecting against an entire IDC failure (power outage, network cut). The architecture is similar to dual‑machine hot standby but with greater distance; latency remains low.
With proper code support, true active‑active (dual‑master with conflict resolution) is possible, though not all applications can handle it.
Many companies adopt a “two‑site‑three‑center” model: two active data centers in a city and a third remote backup center for disaster recovery. Traffic is load‑balanced to the active sites, and data is synchronized via dedicated links. If one city fails, traffic fails over to the other city; if both cities fail, the remote center takes over, albeit with higher latency.
When a city experiences a large‑scale outage (e.g., earthquake), the remote center preserves data, but user experience degrades due to increased latency.
3. Cross‑City Active‑Active
Same‑city active‑active handles most disaster scenarios, but large‑scale events (regional power loss, natural disasters) still cause outages. Extending the two‑city architecture to include cross‑city active‑active allows traffic to shift to a distant city, sacrificing some user experience for continuity.
In this setup, traffic is load‑balanced to both cities; each city’s servers connect only to local databases. Only when both local databases become unavailable does traffic fail over to the remote database cluster, incurring higher latency and potential throughput loss.
To mitigate conflicts, techniques such as distributed locks, distributed transactions, sharding, or eventual consistency are employed.
For applications with strict consistency requirements, a “Global Zone” solution can be used: writes are directed to a single master data center, while reads are served from slaves or bound to the master, all transparent to the business layer. —《Ele.me Cross‑Region Multi‑Active Technical Implementation (Part 1) Overview》
Thus, cross‑city active‑active is a stepping stone toward full cross‑region multi‑active, which provides higher resilience but introduces data‑conflict and latency challenges.
Cross‑Region Multi‑Active
The design connects each node with four inbound/outbound links, ensuring that any single node failure does not affect the service. However, longer write paths increase latency and data‑conflict risk, reducing throughput. Solutions include distributed locks, retry mechanisms, or sharding to keep transactions local.
Alibaba’s “Global Zone” architecture isolates writes to a master zone to guarantee strong consistency, while reads are distributed across zones.
Many businesses, such as ride‑hailing, can shard by city, allowing each data center to operate independently with occasional synchronization for reporting.
E‑commerce platforms, however, have complex inter‑dependencies. Taobao’s solution partitions by business unit, with a central unit handling the most complex scenarios and peripheral units being elastic and fault‑tolerant.
Implementing such architectures requires extensive code refactoring, distributed transaction handling, cache invalidation, and robust testing and operations pipelines.
In summary, the article illustrates that cross‑region multi‑active demands strong foundational capabilities such as data transfer, verification, and a simplified data‑access layer to manage writes and synchronization.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
