Operations 14 min read

How to Build True Multi‑Region Active‑Active Architecture with Bidirectional Sync

This article explains why true multi‑region active‑active requires data to be bidirectionally synchronized across three or more centers, and details a multi‑center disaster‑recovery architecture, distributed ID generation algorithms, CAP considerations, and techniques for achieving eventual consistency.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How to Build True Multi‑Region Active‑Active Architecture with Bidirectional Sync

Background

True multi‑region active‑active is achieved only when data can be bidirectionally synchronized among three or more centers; two‑center setups fall short. The article introduces a three‑center, cross‑overseas scenario and shares a multi‑center disaster‑recovery architecture, distributed ID generation algorithms, and the final consistency implementation process.

System CAP

For a globally distributed system, partition tolerance is mandatory, while consistency and availability cannot both be fully satisfied. The design chooses eventual consistency as the best trade‑off for online services.

Design Principles

Data partitioning is performed by selecting a data dimension for slicing, allowing services to be deployed in different data centers. Primary keys are designed as distributed IDs to avoid conflicts during synchronization.

SnowFlake Algorithm

Bit layout:

+--------------------------------------------------------------------------+
| 1 Bit Unused | 41 Bit Timestamp | 10 Bit NodeId | 12 Bit Sequence Id |
+--------------------------------------------------------------------------+

Advantages: stateless, no network calls, high efficiency. Disadvantages: clock dependency, limited capacity (69 years), concurrency limit (4096 IDs per ms), only works with int64 IDs.

Suitable for non‑Web applications that use int64 IDs.

Web applications cannot use it because JavaScript’s maximum integer precision is 53 bits, leading to loss of accuracy.

RainDrop Algorithm

Adapted from SnowFlake to avoid JavaScript precision loss, using a 53‑bit layout:

+--------------------------------------------------------------------------+
| 11 Bit Unused | 32 Bit Timestamp | 7 Bit NodeId | 14 Bit Sequence Id |
+--------------------------------------------------------------------------+

Advantages: stateless, no network calls, high efficiency. Disadvantages: clock dependency, limited capacity (136 years), lower concurrency than SnowFlake, only works with int64 IDs.

Suitable for Web applications that require int64 IDs.

Partition‑Independent Allocation Algorithm

IDs are segmented and allocated to independent units, with intra‑unit coordination via shared Redis. Example: int32 range split into regions, each supporting 100 million IDs.

Advantages: stateless across regions, reliable uniqueness. Disadvantages: limited capacity per partition, cannot infer generation order from the ID.

Centralized Allocation Algorithm

Uses a central service (Redis, ZooKeeper, or database auto‑increment) for global ID allocation.

Global monotonic increase

Reliable uniqueness

No capacity or concurrency limits

Disadvantage: adds system complexity and strong reliance on the central service.

Choosing an Allocation Algorithm

ID type: int64 vs. int32

Business capacity and concurrency requirements

Need for JavaScript interaction

Center Closure

Calls should stay within the local center to reduce latency and avoid concurrent writes. Routing strategies such as ADNS, Tengine, or sidecar can be employed.

Final Consistency

Because DTS lacks full bidirectional sync, a custom component (DRC) is used. Message ordering is ensured per primary key while allowing concurrency across keys. Failover is handled via Raft leader election.

Cross‑unit messaging uses Alibaba Cloud MNS with message coloring to preserve order; however, MNS does not guarantee strict ordering, so eventual consistency is acceptable.

CRDT principles (commutativity, associativity, idempotence) are applied. Insert operations are transformed to INSERT IGNORE and possibly to UPDATE if needed; update operations are idempotent but may be converted to INSERT when missing; delete operations are always safe.

Disaster Recovery Architecture

The architecture uses two‑level scheduling to enforce center closure and a custom sync component for bidirectional synchronization. Fast recovery strategies include quickly cutting off a center while disabling writes during the transition to avoid double writes.

Conclusion

The architecture must evolve continuously; the most suitable multi‑center solution depends on specific business needs. The presented multi‑center architecture and implementation are open for discussion.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed Systemsdisaster recoverydata synchronizationdistributed ID generationmulti-region active-active
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.