How to Build a Multi‑Active Architecture: From Traffic Routing to Database Sharding
This article explains the design and implementation steps for a multi‑active (geo‑distributed) system, covering data‑center segmentation, traffic scheduling, RPC routing, database and Redis sharding, distributed IDs, message queue handling, job coordination, and cut‑over procedures.
The article outlines a practical approach to transforming a single‑data‑center e‑commerce platform into a multi‑active architecture, where services run in multiple geographically separated data centers (center and unit rooms) to achieve high availability and disaster recovery.
1. Data‑Center Segmentation
Two data centers are defined: Center Data Center (A) , the original production environment, and Unit Data Center (B) , the new active‑active location. The unit data center is treated as a logical unit that can complete a full business flow (browse, order, pay, view order) independently.
2. Core Refactoring Points
2.1 Traffic Scheduling
Before multi‑active, DNS resolves to a single data center. After activation, DNS resolves randomly, which can cause state‑inconsistent errors (e.g., an order created in the center data center is not found after a request lands in the unit data center). To solve this, a custom DLB gateway built on OpenResty routes requests based on the user’s assigned data‑center, and returns the chosen IP in a response header for client‑side caching.
2.2 RPC Framework
Services are categorized into three routing types:
Default Route : Prefer center services; fall back to unit services if unavailable.
Unit Route : All calls stay within the unit data center.
Center Route : Calls from unit to center services only.
Developers annotate Java interfaces with @HARoute to specify the route type. The metadata is stored in Nacos, and the Dubbo registration process includes the route type.
2.3 Database Layer
Three database categories are defined:
Unit DB : Deployed in both data centers with bidirectional sync.
Center DB : Exists only in the center data center.
Center‑Unit DB : Writes happen in the center; reads are allowed in both, with one‑way sync to the unit.
A custom proxy middleware (Rainbow Bridge) built on ShardingSphere handles write‑blocking during cut‑over and enforces sharding keys (buyerId) for unit tables.
2.4 Distributed ID
To avoid primary‑key collisions across data centers, a globally unique ID service replaces per‑center auto‑increment strategies.
2.5 Redis
Two Redis clusters are used: a center cluster (only in the center) and a unit cluster (deployed in both). The application configures the data source mode (center or unit) and selects the appropriate RedisTemplate bean. Cache invalidation is performed via binlog subscription to keep consistency across centers.
2.6 RocketMQ
Message consumption modes are introduced:
Center Subscription : All messages are consumed in the center.
Unit Subscription : Messages are filtered by sharding key and consumed only in the owning data center.
Full‑Unit Subscription : Every data center consumes all messages.
Normal Subscription : Default nearest‑center consumption.
Producers must embed buyerId for unit‑specific messages, and consumers configure the desired subscription mode.
2.7 Job & TOC (Timeout Center)
Batch jobs run mainly in the center; unit‑specific jobs can be executed in both centers with buyer‑based routing. The TOC service schedules delayed actions (e.g., order auto‑cancellation) and requires a buyerId to route callbacks to the correct data center.
3. Service Classification
Services are divided into:
Center Services : Deployed only in the center data center, using center DB.
Unit Services : Deployed in both centers, using unit DB, and require buyerId as the first method parameter.
Center‑Unit Services : Contain both center and unit APIs; databases are split accordingly (center‑unit DB).
4. Cut‑Over Procedure
Issue write‑block rules via the multi‑active control console.
Rainbow Bridge detects rule changes and blocks writes based on sharding keys.
Feedback on rule enforcement is sent back to the control console.
The control console notifies Otter (data‑sync tool) of the effective timestamp.
Otter synchronizes data up to that timestamp.
Otter reports completion; the control console then updates traffic rules in DLB, RPC, and Rainbow Bridge.
5. Summary
The multi‑active transformation touches middleware (SLB, DLB, RPC, Nacos, ShardingSphere, Redis, RocketMQ), business logic (interface annotations, sharding keys, job routing), and operational processes (cut‑over, write‑blocking, data sync). While the approach greatly improves availability, it introduces significant complexity and requires careful service‑level design, thorough testing, and incremental rollout.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
