Operations 17 min read

Avoid Common Pitfalls in Geo-Active High-Availability Design

This article examines common misconceptions in designing geo-distributed active-active systems, explains why striving for perfect real-time data sync is unrealistic, and offers practical strategies—such as prioritizing core services, reducing distance, limiting data replication, and combining storage sync with messaging—to achieve reliable high-availability.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
Avoid Common Pitfalls in Geo-Active High-Availability Design

1. Introduction

Having participated in designing a high-availability solution for Alibaba Games and sharing the article "Business‑Oriented Three‑Dimensional High‑Availability Architecture Design", I discovered that many engineers are fascinated by "geo‑active" designs because they are essential for large‑scale internet services, yet they find them extremely difficult due to intertwined network, data, and transaction problems.

Most of these difficulties stem from the pursuit of a perfect geo‑active solution, which creates many mental traps that lead to dead ends.

2. All Business Geo‑Active

Trying to make every business component geo‑active often results in unsolvable conflicts. For example, a user registration service that must guarantee unique phone numbers cannot simply redirect users to another data center when the primary center is down, because the secondary center cannot verify uniqueness without synchronized data.

Changing core business rules (e.g., allowing duplicate phone numbers) is usually unacceptable due to the high cost of redesign.

Similarly, user‑information updates can cause conflicts when both centers modify the same record concurrently, and relying on timestamps requires perfectly synchronized clocks, which is impossible across distant data centers.

The practical answer is to prioritize core business for geo‑active design. In a typical user subsystem (registration, login, user info), login is the most critical service and should be made geo‑active, while registration and user‑info can tolerate occasional failures.

3. Real‑Time Consistency

Data replication is the core of geo‑active design, but aiming for real‑time synchronization across regions is physically impossible due to propagation delays, network outages, and other uncontrollable factors.

To mitigate the impact, three approaches are recommended:

Reduce the distance between data centers (e.g., same‑city multi‑center).

Minimize the amount of data that needs to be synchronized.

Accept eventual consistency instead of real‑time consistency.

For example, login tokens or session data can be kept locally and regenerated if a user switches centers, while core account data can be synchronized using more reliable mechanisms.

4. Relying Solely on Storage Sync

Although most storage systems provide built‑in replication (MySQL master‑slave, Redis Cluster, Elasticsearch), depending only on these mechanisms can be a trap, especially under extreme conditions where replication lag becomes significant or full‑sync operations block service.

Therefore, combine storage sync with other techniques such as message queues, secondary reads, back‑source reads, or even regenerating data on demand.

For the user subsystem, possible synchronization methods include:

Message‑queue propagation for account data (which is append‑only).

Secondary read: if local data is missing, fetch it from the remote center.

Native storage replication for low‑frequency data like passwords.

Back‑source read for large session data.

Regenerate session data when both centers are unavailable.

The overall synchronization architecture is illustrated below:

5. 100% Availability Myth

Expecting 100% availability in geo‑active systems is unrealistic; physical limits and network failures inevitably cause some data loss or service interruption.

Instead of chasing perfection, accept a small amount of loss and design compensating measures such as announcements, user compensation, or supplemental notifications.

6. One‑Sentence Summary

Use multiple techniques to ensure that the vast majority of users experience core‑service geo‑active availability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

high availabilitySystem DesignData ConsistencyActive-Activegeo-distributed
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.