Operations 14 min read

Stateful Services and High‑Availability Solutions: From Cold Backup to Multi‑Region Active‑Active

This article examines stateful backend services and various high‑availability strategies—including cold backup, active/standby hot backup, same‑city active‑active, cross‑region active‑active, and multi‑active architectures—detailing their advantages, limitations, and practical implementation considerations, and includes real‑world examples from major e‑commerce platforms.

Architecture Digest
Architecture Digest
Architecture Digest
Stateful Services and High‑Availability Solutions: From Cold Backup to Multi‑Region Active‑Active

Stateful Services

Backend services can be divided into two categories: stateful and stateless. High availability is relatively simple for stateless applications, which can rely on load balancers or proxies. The article focuses on stateful services, whose state is maintained via disk or memory (e.g., MySQL, Redis) or JVM memory, which typically has a short lifecycle.

High‑Availability Solutions

The evolution of high‑availability solutions includes:

Cold backup

Dual‑machine hot backup

Same‑city active‑active

Cross‑region active‑active

Cross‑region multi‑active

Understanding earlier solutions helps explain the design rationale of later architectures.

Cold Backup

Cold backup copies data by stopping the database service and using file copy commands (e.g., cp on Linux). It is simple, fast to back up and restore, and can restore to specific points in time, but it requires service downtime, risks data loss between backup and restore, and involves full‑volume copies that waste storage and time.

Dual‑Machine Hot Backup

Hot backup allows simultaneous backup and service provision, though restoration still requires downtime. Two main modes are discussed:

Active/Standby

One primary node serves traffic while a secondary node acts as a backup, synchronizing data via software (e.g., MySQL master/slave binlog) or hardware (disk mirroring). This is essentially application‑level disaster recovery.

Dual‑Machine Mutual Backup

Both machines act as primary for different services, enabling read‑write separation and better resource utilization.

Same‑City Active‑Active

This extends dual‑machine solutions across a city‑level data center, providing failover when an entire IDC fails. It may involve two active masters with read‑write capabilities, but network latency remains low due to proximity.

The "two‑region three‑center" model uses load balancers to route traffic to primary data centers, synchronizes data between same‑city sites, and replicates to a distant disaster‑recovery site. If any primary site fails, traffic fails over to the remaining site; if both fail, the remote site takes over, albeit with higher latency.

Cross‑Region Active‑Active

When large‑scale disasters occur, cross‑region active‑active can keep services running by routing traffic to a secondary city after failover, though user experience may degrade due to latency and reduced throughput.

Cross‑Region Multi‑Active

Multi‑active expands active‑active by connecting multiple regions in a mesh, aiming for no single point of failure. However, increased write latency and data‑conflict risk require solutions such as distributed locks, sharding, or a "Global Zone" where writes are directed to a single master region while reads are served locally.

Practical implementations often shard data by geographic keys (e.g., province/city) and use a star‑topology to reduce synchronization overhead, designating a central node with higher reliability requirements.

Considerations and Thought Questions

The article ends with reflective questions about sharding strategies, which business modules can adopt multi‑active, and whether all services need multi‑active deployment.

References

Ele.me cross‑region multi‑active technical implementation (Part 1)

Ele.me framework tools blog

Alibaba architecture evolution for cross‑region multi‑active

Alibaba Cloud database cross‑region solution

Why cross‑region multi‑active isn’t that hard

distributed systemsHigh Availabilitydisaster recoverymulti-activestateful servicescold backupactive standby
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.