Stateful Services and High‑Availability Solutions: From Cold Backup to Multi‑Region Active‑Active
This article examines stateful backend services and various high‑availability strategies—including cold backup, active/standby hot backup, same‑city active‑active, cross‑region active‑active, and multi‑active architectures—detailing their advantages, limitations, and practical implementation considerations, and includes real‑world examples from major e‑commerce platforms.
Stateful Services
Backend services can be divided into two categories: stateful and stateless. High availability is relatively simple for stateless applications, which can rely on load balancers or proxies. The article focuses on stateful services, whose state is maintained via disk or memory (e.g., MySQL, Redis) or JVM memory, which typically has a short lifecycle.
High‑Availability Solutions
The evolution of high‑availability solutions includes:
Cold backup
Dual‑machine hot backup
Same‑city active‑active
Cross‑region active‑active
Cross‑region multi‑active
Understanding earlier solutions helps explain the design rationale of later architectures.
Cold Backup
Cold backup copies data by stopping the database service and using file copy commands (e.g., cp on Linux). It is simple, fast to back up and restore, and can restore to specific points in time, but it requires service downtime, risks data loss between backup and restore, and involves full‑volume copies that waste storage and time.
Dual‑Machine Hot Backup
Hot backup allows simultaneous backup and service provision, though restoration still requires downtime. Two main modes are discussed:
Active/Standby
One primary node serves traffic while a secondary node acts as a backup, synchronizing data via software (e.g., MySQL master/slave binlog) or hardware (disk mirroring). This is essentially application‑level disaster recovery.
Dual‑Machine Mutual Backup
Both machines act as primary for different services, enabling read‑write separation and better resource utilization.
Same‑City Active‑Active
This extends dual‑machine solutions across a city‑level data center, providing failover when an entire IDC fails. It may involve two active masters with read‑write capabilities, but network latency remains low due to proximity.
The "two‑region three‑center" model uses load balancers to route traffic to primary data centers, synchronizes data between same‑city sites, and replicates to a distant disaster‑recovery site. If any primary site fails, traffic fails over to the remaining site; if both fail, the remote site takes over, albeit with higher latency.
Cross‑Region Active‑Active
When large‑scale disasters occur, cross‑region active‑active can keep services running by routing traffic to a secondary city after failover, though user experience may degrade due to latency and reduced throughput.
Cross‑Region Multi‑Active
Multi‑active expands active‑active by connecting multiple regions in a mesh, aiming for no single point of failure. However, increased write latency and data‑conflict risk require solutions such as distributed locks, sharding, or a "Global Zone" where writes are directed to a single master region while reads are served locally.
Practical implementations often shard data by geographic keys (e.g., province/city) and use a star‑topology to reduce synchronization overhead, designating a central node with higher reliability requirements.
Considerations and Thought Questions
The article ends with reflective questions about sharding strategies, which business modules can adopt multi‑active, and whether all services need multi‑active deployment.
References
Ele.me cross‑region multi‑active technical implementation (Part 1)
Ele.me framework tools blog
Alibaba architecture evolution for cross‑region multi‑active
Alibaba Cloud database cross‑region solution
Why cross‑region multi‑active isn’t that hard
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.