Operations 22 min read

Designing Multi‑Active Distributed Systems: Overcoming Write Latency and Data Replication Challenges

This article analyzes the architectural impact of cross‑city multi‑active deployments, focusing on data‑layer design, write latency, sharding strategies, replication topologies, and routing considerations to achieve high availability, performance, and scalability in large‑scale distributed systems.

Architect

Jan 3, 2025

Designing Multi‑Active Distributed Systems: Overcoming Write Latency and Data Replication Challenges

Background

Cross‑city multi‑active (异地多活) is a high‑level design goal for distributed systems whose scale and complexity demand robust high‑availability solutions. Typical three‑tier architectures—access, logic, and data layers—share a common foundation, with the data layer being the critical factor.

Key Challenges

High availability: mitigate node failures with redundancy and rapid recovery.

High performance: support massive concurrent requests with low response latency.

Scalability: adapt to evolving business requirements and traffic patterns.

Cost efficiency: balance resource investment against business value.

Security and functionality: protect data integrity while supporting diverse features.

Data‑Layer Focus

The logic layer is stateless and can switch seamlessly, but the data layer must ensure consistency and integrity during failures. Write operations are the bottleneck because they must synchronize data across replicas, whereas reads can be served from any replica.

Write Latency Across Cities

When moving from intra‑datacenter to cross‑city replication, write latency jumps from sub‑millisecond to tens of milliseconds. Typical round‑trip times measured with ping are:

Same data center: ≤0.5 ms.

Same city, different data center: ≤3 ms.

Different cities (e.g., Shenzhen‑Shanghai): ≤30 ms.

At 30 ms, a single write may incur ~60 ms total latency (request + acknowledgment). If a business cannot tolerate multiple such delays, synchronous cross‑city replication becomes impractical.

Mitigation Strategies

Short‑distance synchronous replication: Deploy replicas in nearby cities (e.g., Guangzhou‑Shenzhen, Shanghai‑Hangzhou) where latency is 5‑7 ms, allowing synchronous writes while still not achieving true multi‑active isolation.

Asynchronous replication with geographic sharding: Partition data by user location, write to the nearest shard, and replicate asynchronously. This reduces write latency but introduces potential data loss during disasters.

Sharding Approaches

Large Write Volume Sharding

When write throughput exceeds a single node’s capacity, split data into multiple shards, each with its own write endpoint. Near‑city sharding can also reduce latency.

Isolation Sharding

Separate shards to avoid single points of failure, applying isolation not only to the data layer but also to dependent services and logic layers (often called “unitization”, “SET”, or “striping”).

Replication Topologies

Simple primary‑secondary setups are insufficient for disaster recovery. Common architectures include:

Three‑Site Five‑Center

One primary and four secondaries spread across three cities and five IDC locations. Writes must be persisted to at least three instances (primary + two secondaries) to satisfy quorum.

Three‑Site Three‑Center

Minimal quorum but higher cost and complexity; often avoided.

Same‑City Three‑Center

Used when cross‑city latency is unacceptable; provides local redundancy.

Dual‑Primary Mutual Replication

Both primaries hold full data but only one processes writes at a time, preventing conflicts.

Unsynced Write List

Track writes that have not yet been replicated; reject operations on those keys until synchronization completes, reducing inconsistency at the cost of occasional data loss.

Routing Implications

Global data: Single write point with multiple read replicas; route reads based on proximity (IDC → city → global).

Near‑city sharded data: Route both reads and writes to the shard’s home city to avoid cross‑city write latency.

Cross‑city sharded data: Allow cross‑city synchronous writes but still route requests to the shard’s primary location; a third city may host only backup replicas for quorum.

Design Process

When evaluating architecture options, consider factors such as latency budgets, write volume, fault isolation, resource utilization, and cost. The final design often combines multiple patterns to meet specific business requirements.

Conclusion

Multi‑active architectures require careful balancing of write latency, replication strategy, sharding, and routing. By analyzing latency thresholds, choosing appropriate replication topologies, and applying geographic sharding, architects can achieve high availability and performance while controlling cost and complexity.

Illustrations

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Sharding Data Replication multi-active architecture write latency

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.