Designing Multi‑Active Distributed Systems: Overcoming Write Latency and Data Replication Challenges
This article analyzes the architectural impact of cross‑city multi‑active deployments, focusing on data‑layer design, write latency, sharding strategies, replication topologies, and routing considerations to achieve high availability, performance, and scalability in large‑scale distributed systems.
Background
Cross‑city multi‑active (异地多活) is a high‑level design goal for distributed systems whose scale and complexity demand robust high‑availability solutions. Typical three‑tier architectures—access, logic, and data layers—share a common foundation, with the data layer being the critical factor.
Key Challenges
High availability: mitigate node failures with redundancy and rapid recovery.
High performance: support massive concurrent requests with low response latency.
Scalability: adapt to evolving business requirements and traffic patterns.
Cost efficiency: balance resource investment against business value.
Security and functionality: protect data integrity while supporting diverse features.
Data‑Layer Focus
The logic layer is stateless and can switch seamlessly, but the data layer must ensure consistency and integrity during failures. Write operations are the bottleneck because they must synchronize data across replicas, whereas reads can be served from any replica.
Write Latency Across Cities
When moving from intra‑datacenter to cross‑city replication, write latency jumps from sub‑millisecond to tens of milliseconds. Typical round‑trip times measured with ping are:
Same data center: ≤0.5 ms.
Same city, different data center: ≤3 ms.
Different cities (e.g., Shenzhen‑Shanghai): ≤30 ms.
At 30 ms, a single write may incur ~60 ms total latency (request + acknowledgment). If a business cannot tolerate multiple such delays, synchronous cross‑city replication becomes impractical.
Mitigation Strategies
Short‑distance synchronous replication: Deploy replicas in nearby cities (e.g., Guangzhou‑Shenzhen, Shanghai‑Hangzhou) where latency is 5‑7 ms, allowing synchronous writes while still not achieving true multi‑active isolation.
Asynchronous replication with geographic sharding: Partition data by user location, write to the nearest shard, and replicate asynchronously. This reduces write latency but introduces potential data loss during disasters.
Sharding Approaches
Large Write Volume Sharding
When write throughput exceeds a single node’s capacity, split data into multiple shards, each with its own write endpoint. Near‑city sharding can also reduce latency.
Isolation Sharding
Separate shards to avoid single points of failure, applying isolation not only to the data layer but also to dependent services and logic layers (often called “unitization”, “SET”, or “striping”).
Replication Topologies
Simple primary‑secondary setups are insufficient for disaster recovery. Common architectures include:
Three‑Site Five‑Center
One primary and four secondaries spread across three cities and five IDC locations. Writes must be persisted to at least three instances (primary + two secondaries) to satisfy quorum.
Three‑Site Three‑Center
Minimal quorum but higher cost and complexity; often avoided.
Same‑City Three‑Center
Used when cross‑city latency is unacceptable; provides local redundancy.
Dual‑Primary Mutual Replication
Both primaries hold full data but only one processes writes at a time, preventing conflicts.
Unsynced Write List
Track writes that have not yet been replicated; reject operations on those keys until synchronization completes, reducing inconsistency at the cost of occasional data loss.
Routing Implications
Global data: Single write point with multiple read replicas; route reads based on proximity (IDC → city → global).
Near‑city sharded data: Route both reads and writes to the shard’s home city to avoid cross‑city write latency.
Cross‑city sharded data: Allow cross‑city synchronous writes but still route requests to the shard’s primary location; a third city may host only backup replicas for quorum.
Design Process
When evaluating architecture options, consider factors such as latency budgets, write volume, fault isolation, resource utilization, and cost. The final design often combines multiple patterns to meet specific business requirements.
Conclusion
Multi‑active architectures require careful balancing of write latency, replication strategy, sharding, and routing. By analyzing latency thresholds, choosing appropriate replication topologies, and applying geographic sharding, architects can achieve high availability and performance while controlling cost and complexity.
Illustrations
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
