Operations 22 min read

Designing Multi‑Active Distributed Systems: Key Factors and Replication Strategies

This article analyzes the architectural challenges of building large‑scale distributed systems with multi‑active (cross‑city) capabilities, focusing on data‑layer design, write latency, replication models, sharding techniques, and routing impacts to guide reliable, high‑performance infrastructure decisions.

Architect

Oct 17, 2024

Designing Multi‑Active Distributed Systems: Key Factors and Replication Strategies

Background

Large‑scale internet services must balance high availability, low latency, scalability, cost efficiency, security, and functional richness. In a typical three‑tier architecture (access‑logic‑data), the data layer is the bottleneck for multi‑active designs because write operations must stay consistent across replicas.

Write Latency Across Regions

Empirical ping measurements show:

Intra‑datacenter round‑trip < 0.5 ms.

Intra‑city (different IDC within the same city) ≈ 3 ms.

Cross‑city (e.g., Shenzhen‑Shanghai) ≈ 30 ms; longer paths such as Beijing‑Shanghai are slightly higher.

When latency reaches ~30 ms, a write request incurs roughly 60 ms of round‑trip time (request + replication). Systems must decide whether to accept this delay or adopt alternative designs.

Replication Strategies

Synchronous Replication Over Short Distances

Deploy replicas in nearby cities (e.g., Guangzhou‑Shenzhen, Shanghai‑Hangzhou) where the RTT is 5‑7 ms. Synchronous replication guarantees strong consistency but does not achieve true cross‑city multi‑active goals.

Asynchronous Near‑Sharding (Lossy)

Shard data by user geography; each shard is written locally without waiting for cross‑city sync. This is suitable for workloads tolerant of eventual consistency (social media, video streaming).

Large‑Write Sharding

When a single node cannot sustain the write volume, split the dataset into multiple shards, each with its own primary write point. Near‑sharding can still be applied to reduce latency.

Isolation Sharding

Separate shards to eliminate a single point of failure. Isolation is applied not only to the data layer but also to dependent services and the logic layer.

Layer‑Wide Isolation (Unitization)

Route requests at the access layer to a specific “unit”. The logic layer processes only data belonging to that unit, and the data layer can optionally use cross‑city synchronous replication for added resilience.

Data Replication Topologies

Three‑Site Five‑Center

One primary and four secondaries are spread across three cities and five IDC sites. Writes must be persisted to at least three replicas to satisfy a majority quorum.

Three‑Site Three‑Center

Provides a smaller majority but incurs higher cross‑city coordination cost; generally less preferred.

Same‑City Three‑Center

All replicas reside within a single city, eliminating cross‑city latency at the expense of geographic redundancy.

Dual‑Master Mutual Replication

Each instance holds a full copy of the data, but only one master processes a subset of writes at any time, avoiding write conflicts.

Unsynced List (Lossy)

Maintain a list of records that have not yet been replicated to remote sites. Operations on unsynced records are rejected to preserve consistency, accepting minimal data loss.

Routing Implications

Global (Non‑Sharded) Data

There is a single write point; reads are routed to the nearest replica based on a hierarchy: datacenter → city → global.

Near‑Sharding Data

Both reads and writes are directed to the city where the shard resides, ensuring all operations stay within the same city and avoid cross‑city latency.

Cross‑City Sharding Data

Shards are placed in different cities, and an additional city provides a majority replica for disaster recovery. Writes may traverse multiple cities synchronously.

Architecture Selection Guidance

When designing a multi‑active system, evaluate:

Business scale and growth trajectory.

Latency tolerance (e.g., can the application accept 30 ms or 60 ms write delay?).

Write volume and whether sharding is required.

Consistency requirements (strong vs. eventual).

Resource and cost constraints.

Select a replication topology and routing strategy that balances performance, availability, and cost based on these factors.

Code example

相关阅读：

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems architecture high availability data replication multi-active

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.