Designing Geo‑Distributed Active‑Active Systems Without Breaking Consistency
This article explains the concepts, costs, architectural patterns, and practical design techniques for building multi‑active systems across different geographic locations while managing latency, data consistency, and business continuity.
Geographic distribution means different physical locations, and active‑active means each location can provide business services.
Judgment criteria:
Under normal conditions, users receive correct services regardless of which site they access.
If one site fails, users can still obtain correct services from another healthy site.
Costs of geo‑distributed active‑active:
The system complexity changes qualitatively.
Costs increase significantly.
Architecture Patterns
1. Same‑city Different District
Deploy two data centers in different districts of the same city, connected by a dedicated network.
The distance is usually only tens of kilometers, so network latency is almost the same as within a single data center, reducing complexity and cost.
This pattern cannot handle extreme disasters such as city‑wide earthquakes or floods; it is intended for routine failures like fire, power outage, or air‑conditioning failure.
2. Cross‑city Different Region
Deploy data centers in different cities, e.g., Beijing and Guangzhou.
This pattern addresses extreme disasters such as city‑wide earthquakes or large‑scale power outages.
The main issue is network latency: a typical RTT between Beijing and Guangzhou is about 50 ms, which can rise to 500 ms or even 1 s under network instability, and packet loss may occur.
Physical distance inevitably leads to data inconsistency. For data that requires strong consistency (e.g., account balances), cross‑city active‑active is infeasible.
Example: a user transfers money in Guangzhou while the Beijing link is broken, causing divergent balances.
Therefore, only data with weak consistency requirements should use cross‑city active‑active, while strongly consistent data should stay in a same‑city architecture.
Typical scenarios with low consistency requirements:
User login – re‑login can resolve inconsistencies.
News sites – daily updates are infrequent.
Micro‑blog platforms – occasional loss of posts or comments is acceptable.
3. Cross‑country Different Region
Deploy data centers in different countries.
Latency is even higher, often several seconds, making it unsuitable for latency‑sensitive services.
Cross‑country active‑active is appropriate for:
Providing services to users in different regions (e.g., Amazon US vs. Amazon China).
Read‑only workloads (e.g., Google Search results are largely the same worldwide, and a few seconds of delay does not affect user experience).
Cross‑city Active‑Active Design Tips
1. Prioritize Core Business for Multi‑Active
Misconception: all services must be multi‑active.
Example: user registration, login, and profile services are partitioned across data centers. If a user registers in center A and the data has not yet synced to center B when A fails, the user must re‑register in B, causing conflicts.
Thus, focus on making the truly core service (e.g., login, which handles millions of daily requests) multi‑active, while less critical services may tolerate downtime.
2. Ensure Eventual Consistency for Core Data
Misconception: all data must be synchronized in real time.
Physical limits make real‑time global sync impossible; therefore, synchronize only the data essential to core business.
For example, login tokens or session data are large; losing them may require a re‑login, which is acceptable.
3. Use Multiple Mechanisms to Sync Data
Misconception: rely solely on the storage system’s sync features.
While databases like MySQL or Redis provide strong sync, some scenarios need additional techniques:
Message queues – propagate newly created accounts to other centers.
Secondary reads – if a local read fails, route the request to another center.
Origin fetch – if a session is missing, another center can request it from the original center using the session ID.
4. Aim for High Availability, Not 100% Uptime
Misconception: the system must be 100 % available.
Physical laws prevent absolute availability. For example, real‑time cross‑city transfers can cause double‑spending if both centers process the same transaction during a network partition.
Solution: introduce a “transfer request” workflow where the operation is asynchronous and can be retried after the failed center recovers.
This introduces an extra step for users but preserves data integrity.
Compensation measures for user experience include:
Announcements explaining the issue.
Post‑incident compensation such as vouchers.
Additional notifications (e.g., SMS) after the operation completes.
Conclusion
Content compiled from "From Zero to Architecture".
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
