Designing Multi‑Active Distributed Systems: Key Factors and Replication Strategies
This article analyzes the architectural challenges of building large‑scale distributed systems with multi‑active (cross‑city) capabilities, focusing on data‑layer design, write latency, replication models, sharding techniques, and routing impacts to guide reliable, high‑performance infrastructure decisions.
Background
Large‑scale internet services must balance high availability, low latency, scalability, cost efficiency, security, and functional richness. In a typical three‑tier architecture (access‑logic‑data), the data layer is the bottleneck for multi‑active designs because write operations must stay consistent across replicas.
Write Latency Across Regions
Empirical ping measurements show:
Intra‑datacenter round‑trip < 0.5 ms.
Intra‑city (different IDC within the same city) ≈ 3 ms.
Cross‑city (e.g., Shenzhen‑Shanghai) ≈ 30 ms; longer paths such as Beijing‑Shanghai are slightly higher.
When latency reaches ~30 ms, a write request incurs roughly 60 ms of round‑trip time (request + replication). Systems must decide whether to accept this delay or adopt alternative designs.
Replication Strategies
Synchronous Replication Over Short Distances
Deploy replicas in nearby cities (e.g., Guangzhou‑Shenzhen, Shanghai‑Hangzhou) where the RTT is 5‑7 ms. Synchronous replication guarantees strong consistency but does not achieve true cross‑city multi‑active goals.
Asynchronous Near‑Sharding (Lossy)
Shard data by user geography; each shard is written locally without waiting for cross‑city sync. This is suitable for workloads tolerant of eventual consistency (social media, video streaming).
Large‑Write Sharding
When a single node cannot sustain the write volume, split the dataset into multiple shards, each with its own primary write point. Near‑sharding can still be applied to reduce latency.
Isolation Sharding
Separate shards to eliminate a single point of failure. Isolation is applied not only to the data layer but also to dependent services and the logic layer.
Layer‑Wide Isolation (Unitization)
Route requests at the access layer to a specific “unit”. The logic layer processes only data belonging to that unit, and the data layer can optionally use cross‑city synchronous replication for added resilience.
Data Replication Topologies
Three‑Site Five‑Center
One primary and four secondaries are spread across three cities and five IDC sites. Writes must be persisted to at least three replicas to satisfy a majority quorum.
Three‑Site Three‑Center
Provides a smaller majority but incurs higher cross‑city coordination cost; generally less preferred.
Same‑City Three‑Center
All replicas reside within a single city, eliminating cross‑city latency at the expense of geographic redundancy.
Dual‑Master Mutual Replication
Each instance holds a full copy of the data, but only one master processes a subset of writes at any time, avoiding write conflicts.
Unsynced List (Lossy)
Maintain a list of records that have not yet been replicated to remote sites. Operations on unsynced records are rejected to preserve consistency, accepting minimal data loss.
Routing Implications
Global (Non‑Sharded) Data
There is a single write point; reads are routed to the nearest replica based on a hierarchy: datacenter → city → global.
Near‑Sharding Data
Both reads and writes are directed to the city where the shard resides, ensuring all operations stay within the same city and avoid cross‑city latency.
Cross‑City Sharding Data
Shards are placed in different cities, and an additional city provides a majority replica for disaster recovery. Writes may traverse multiple cities synchronously.
Architecture Selection Guidance
When designing a multi‑active system, evaluate:
Business scale and growth trajectory.
Latency tolerance (e.g., can the application accept 30 ms or 60 ms write delay?).
Write volume and whether sharding is required.
Consistency requirements (strong vs. eventual).
Resource and cost constraints.
Select a replication topology and routing strategy that balances performance, availability, and cost based on these factors.
Code example
相关阅读:Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
