Why Write Latency Drives Multi‑Active Distributed Architecture Design
This article analyzes how write latency, write volume, isolation, and data replication strategies influence the design of multi‑active distributed systems, offering practical guidance on sharding, synchronous and asynchronous replication, routing, and architecture selection for high availability and performance across regions.
Introduction
Multi‑active (异地多活) is a critical milestone in distributed system architecture; when a business reaches this stage, its scale and complexity increase dramatically. The three‑tier architecture—access, logic, and data layers—is common, with the data layer being the key factor influencing design decisions.
Basic Infrastructure
Modern software systems must address high availability, high performance, scalability, cost efficiency, security, and multifunctionality. These goals translate into challenges such as node failures, massive concurrent requests, evolving business requirements, and the need to balance cost with value.
The architecture can be abstracted as a data‑centric processing pipeline: access → logic → data. The focus of this article is on the data layer from a backend perspective.
Multi‑Active Considerations
When a system requires multi‑active capabilities, high availability becomes the core objective. Disaster recovery (DR) must provide redundancy to avoid single points of failure, ensuring that when a component fails, another can take over with minimal impact.
Failure scenarios differ by business scale:
Single‑machine deployments face disk, OS, or data loss failures; backup and master‑slave setups are needed.
Single‑data‑center deployments distribute machines across multiple data centers to mitigate data‑center‑level failures.
Single‑city deployments require cross‑city redundancy because city‑wide disasters (e.g., typhoons, earthquakes) can affect all local data centers.
Resolving city‑level single points of failure defines the multi‑active problem: services must be deployed in distant regions (e.g., Guangzhou‑Shenzhen, Beijing‑Tianjin, Shanghai‑Hangzhou) to achieve true cross‑city DR.
Write Latency Is Critical
In the data layer, write operations are the bottleneck because they must maintain consistency across replicas. Local writes can be synchronous with low latency (<0.5 ms). Across cities, latency grows dramatically:
Intra‑data‑center round‑trip < 0.5 ms.
Intra‑city across data centers ≈ 3 ms.
Cross‑city (e.g., Shenzhen‑Shanghai) ≈ 30 ms, with longer paths (Beijing‑Shanghai, Shenzhen‑Tianjin) even higher.
When write latency reaches ~30 ms, business availability is challenged. Synchronous cross‑city replication doubles latency (≈ 60 ms) and may multiply further if multiple sequential writes are required, forcing designers to decide whether the business can tolerate such delays.
Sharding for High Write Volume
If write traffic overwhelms a single write point, data must be split into shards, each with its own write node. Near‑sharding (e.g., Guangzhou‑Shenzhen) can reduce latency to 5‑7 ms, allowing synchronous replication while still not achieving true multi‑active goals.
For write‑heavy, append‑only data (e.g., e‑commerce orders), older low‑frequency data can be archived, reducing storage pressure.
Isolation Sharding
Isolation prevents a single failure from affecting the entire system. By partitioning data into independent shards, failures are contained. Isolation often extends beyond the data layer to include access and logic layers, forming unit‑based or SET‑based isolation schemes.
Access layer routes requests to the appropriate unit.
Logic layer processes only requests belonging to its unit.
Data layer replicates within each unit, possibly using cross‑city synchronous replication.
Data Replication Architectures
Simple master‑slave is insufficient for robust DR. Common architectures include:
Three‑Site Five‑Center
Deploy 1 master and 4 slaves across three cities and five IDC locations. Writes must be persisted to at least three instances (master + two slaves) to satisfy majority consensus, ensuring data availability even if an entire city fails.
Three‑Site Three‑Center
Uses the minimal majority but incurs higher cross‑city switch costs and resource waste, making it less popular.
Same‑City Three‑Center
Provides intra‑city DR without cross‑city latency, suitable when cross‑city delays are unacceptable.
Dual‑Master Mutual Replication
Each instance holds full data, but only one master processes a subset of writes at any time to avoid conflicts. Writes are timestamped (Ti), replication timestamps (Ts) are recorded, and during a failure the affected instance is banned (Tb). Inconsistencies arise when Ti < Tb < Ts, leading to duplicate writes.
Unsynced List Mechanism
Before a write, the system records the data’s ownership in an “unsynced list.” After successful cross‑city sync, the entry is cleared. Subsequent writes check this list and reject operations on unsynced data, reducing data loss at the cost of occasional write rejections.
Routing Impact of Data Models
Based on data characteristics, three routing models emerge:
Global data (no sharding) – single write point, multiple read replicas; routing selects the nearest replica.
Near‑sharded data – writes are routed to the shard’s city; reads follow the same path, keeping traffic local.
Cross‑city sharded data – shards accept cross‑city synchronous writes; routing still prefers local access, with a third city providing DR without handling traffic.
Architecture Selection Process
When designing the architecture, evaluate factors such as write latency tolerance, write volume, isolation needs, cost, and resource utilization. The decision flow typically involves:
Assess write latency requirements.
Determine if sharding is needed for write volume.
Choose between synchronous and asynchronous replication based on latency tolerance.
Select an appropriate replication topology (e.g., three‑site five‑center, dual‑master).
Plan routing and isolation strategies to align with business goals.
Each factor interacts; for example, high write volume may push toward near‑sharding, while strict latency constraints may favor synchronous replication within a city and asynchronous replication across cities.
Conclusion
Designing multi‑active distributed systems requires a deep understanding of how write latency, write volume, isolation, and replication choices affect overall availability, performance, and cost. By carefully evaluating these dimensions and selecting suitable architectures, engineers can build resilient systems that meet both technical and business objectives.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
