Operations 19 min read

How Xianyu Achieved Multi‑Region High Availability: Architecture, Challenges, and Solutions

This article details Xianyu's transition from a single‑site to a multi‑region deployment, covering scalability limits, disaster‑recovery strategies, traffic routing, data consistency decisions, service and database architecture, and the operational principles that enable low‑cost, high‑availability scaling across regions.

dbaplus Community

Dec 8, 2021

How Xianyu Achieved Multi‑Region High Availability: Architecture, Challenges, and Solutions

Background

The Xianyu recommendation service for homepage and search faced scalability limits and disaster‑recovery challenges because a single IDC could not keep up with traffic growth. Physical constraints (servers, power) forced model updates to wait for old models to be retired, and a single‑site failure would break the main recommendation pipeline.

Common High‑Availability Architectures

Same‑city active‑active : Two data centers in the same city share traffic. It protects against power or network failures but cannot survive regional disasters and offers limited scalability.

Cross‑region disaster recovery : Resources are duplicated in another region (hot or cold) but do not serve traffic. When a region fails, traffic is switched to the backup. This approach wastes resources and introduces cross‑region latency.

Cross‑region active‑active : Multiple regions and data centers serve traffic simultaneously without a primary‑backup concept. This yields high resource utilization and better scalability, making it the natural choice for algorithmic workloads.

Impact of Moving to Multi‑Region

Transitioning from a single‑site to a multi‑region deployment is not a simple copy‑paste. The following factors increase system complexity:

Traffic scheduling – how new traffic follows the system to additional regions.

Traffic closed‑loop – ensuring that operations complete within the same region despite physical latency.

Data consistency – handling synchronization delays for latency‑sensitive data (e.g., transactions).

Disaster cut‑over – switching traffic without data loss when a region fails.

Multi‑Region Deployment Plan

Key Challenges

Network latency across regions is typically >20 ms (vs. <1 ms intra‑city). Reducing its impact on the recommendation chain is essential.

Maintaining a closed‑loop for traffic within a region while balancing cost.

Increased architectural complexity for traffic routing, service routing, and data synchronization.

Designing routing rules that identify request sources and control destinations.

Establishing deployment standards to avoid rapid architectural decay.

Implementing traffic‑control rules that converge quickly during cut‑over.

Data‑Split Decision

After multi‑region deployment, data resides in multiple regions and synchronization introduces latency. Whether to split user data depends on consistency requirements:

If short‑term inconsistency is tolerable (e.g., most recommendation reads), data can remain unified.

For write‑heavy scenarios such as adding an item to a shopping cart, the write must be visible immediately; otherwise the user experience degrades. In such cases data must be co‑located.

Because Xianyu’s recommendation flow can tolerate brief inconsistency and does not involve cross‑region writes, the team chose not to split user data.

Deployment Architecture

Physically each region is equal (no primary‑backup), but logically a central region is distinguished because some long‑tail dependencies cannot be deployed everywhere.

Traffic Routing Scheme

Three routing principles were evaluated:

Fully random – simple but can cause data inconsistency across regions.

Proximity (region‑nearest) – lowest latency but leads to traffic imbalance.

User‑based split – the same user is always routed to a specific region, guaranteeing consistency.

Given the tolerance for short‑term inconsistency, the first two were rejected; the user‑based approach was adopted.

Implementation options:

DNS‑level routing – high cost, requires separate domain, slow rule convergence.

Unified access‑layer routing – low cost, reuses existing logic, minor cross‑region traffic.

Edge gateway – independent of runtime, supports app/mini‑program/Web, fast convergence, but requires building from scratch.

Considering cost and maturity, option 2 (unified access‑layer) was selected.

Full‑Link Upgrade

The upgrade adapts the system from single‑site to multi‑region, covering:

Application code refactor : Identify dependencies that can be multi‑region and mitigate latency impact.

Service traffic routing policies : Ensure intra‑region traffic loops and automatic failover.

Traffic correction : Monitor anomalies and adjust routing in real time.

External traffic correction : Force non‑recommendation traffic back to the central region.

Strong consistency enforcement was omitted because short‑term inconsistency is acceptable.

Cache refactor : Writes go to the central master node; replication synchronizes caches to other regions.

Consistency handling : Client‑side caching for weakly consistent data; eventual consistency for persisted data.

Service Cluster Deployment

Micro‑service clusters are deployed equally across regions. Two discovery mechanisms are used:

HSF‑based services with configserver per region, isolating traffic.

HTTP‑based algorithm services using either region‑specific Zookeeper or a global vipserver load balancer.

Cache usage is categorized into three patterns:

Read‑through cache : Relieves DB pressure; strong consistency not required.

Persistent cache (e.g., distributed lock, counter): Requires strong consistency; writes go to the central master, reads may be local.

Region‑center synchronization : Ensures eventual consistency for persisted data.

Database Deployment

According to the CAP theorem, a distributed database can satisfy only two of consistency, availability, and partition tolerance. Xianyu’s recommendation flow is read‑heavy and can tolerate brief inconsistency, and the existing storage is MySQL. Therefore a master‑slave replication mode was chosen, providing high availability and partition tolerance while sacrificing strong consistency.

Conclusion

Network latency is the primary challenge of multi‑region deployment. Two guiding principles were applied:

Keep traffic closed‑loop within each region to limit cross‑region latency impact.

Prioritize availability over strong consistency.

By enforcing these principles at the access layer, service layer, and storage layer, Xianyu now runs a two‑site three‑data‑center deployment that handles live traffic with low cost and good scalability. Strong‑consistency scenarios remain difficult, and maintaining the architecture as the business evolves is an ongoing concern.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

traffic routing database replication multi-region deployment cloud architecture

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.