How Xianyu Achieved Multi‑Region High Availability: Architecture, Challenges, and Solutions
This article details Xianyu's transition from a single‑site to a multi‑region deployment, covering scalability limits, disaster‑recovery strategies, traffic routing, data consistency decisions, service and database architecture, and the operational principles that enable low‑cost, high‑availability scaling across regions.
Background
The Xianyu recommendation service for homepage and search faced scalability limits and disaster‑recovery challenges because a single IDC could not keep up with traffic growth. Physical constraints (servers, power) forced model updates to wait for old models to be retired, and a single‑site failure would break the main recommendation pipeline.
Common High‑Availability Architectures
Same‑city active‑active : Two data centers in the same city share traffic. It protects against power or network failures but cannot survive regional disasters and offers limited scalability.
Cross‑region disaster recovery : Resources are duplicated in another region (hot or cold) but do not serve traffic. When a region fails, traffic is switched to the backup. This approach wastes resources and introduces cross‑region latency.
Cross‑region active‑active : Multiple regions and data centers serve traffic simultaneously without a primary‑backup concept. This yields high resource utilization and better scalability, making it the natural choice for algorithmic workloads.
Impact of Moving to Multi‑Region
Transitioning from a single‑site to a multi‑region deployment is not a simple copy‑paste. The following factors increase system complexity:
Traffic scheduling – how new traffic follows the system to additional regions.
Traffic closed‑loop – ensuring that operations complete within the same region despite physical latency.
Data consistency – handling synchronization delays for latency‑sensitive data (e.g., transactions).
Disaster cut‑over – switching traffic without data loss when a region fails.
Multi‑Region Deployment Plan
Key Challenges
Network latency across regions is typically >20 ms (vs. <1 ms intra‑city). Reducing its impact on the recommendation chain is essential.
Maintaining a closed‑loop for traffic within a region while balancing cost.
Increased architectural complexity for traffic routing, service routing, and data synchronization.
Designing routing rules that identify request sources and control destinations.
Establishing deployment standards to avoid rapid architectural decay.
Implementing traffic‑control rules that converge quickly during cut‑over.
Data‑Split Decision
After multi‑region deployment, data resides in multiple regions and synchronization introduces latency. Whether to split user data depends on consistency requirements:
If short‑term inconsistency is tolerable (e.g., most recommendation reads), data can remain unified.
For write‑heavy scenarios such as adding an item to a shopping cart, the write must be visible immediately; otherwise the user experience degrades. In such cases data must be co‑located.
Because Xianyu’s recommendation flow can tolerate brief inconsistency and does not involve cross‑region writes, the team chose not to split user data.
Deployment Architecture
Physically each region is equal (no primary‑backup), but logically a central region is distinguished because some long‑tail dependencies cannot be deployed everywhere.
Traffic Routing Scheme
Three routing principles were evaluated:
Fully random – simple but can cause data inconsistency across regions.
Proximity (region‑nearest) – lowest latency but leads to traffic imbalance.
User‑based split – the same user is always routed to a specific region, guaranteeing consistency.
Given the tolerance for short‑term inconsistency, the first two were rejected; the user‑based approach was adopted.
Implementation options:
DNS‑level routing – high cost, requires separate domain, slow rule convergence.
Unified access‑layer routing – low cost, reuses existing logic, minor cross‑region traffic.
Edge gateway – independent of runtime, supports app/mini‑program/Web, fast convergence, but requires building from scratch.
Considering cost and maturity, option 2 (unified access‑layer) was selected.
Full‑Link Upgrade
The upgrade adapts the system from single‑site to multi‑region, covering:
Application code refactor : Identify dependencies that can be multi‑region and mitigate latency impact.
Service traffic routing policies : Ensure intra‑region traffic loops and automatic failover.
Traffic correction : Monitor anomalies and adjust routing in real time.
External traffic correction : Force non‑recommendation traffic back to the central region.
Strong consistency enforcement was omitted because short‑term inconsistency is acceptable.
Cache refactor : Writes go to the central master node; replication synchronizes caches to other regions.
Consistency handling : Client‑side caching for weakly consistent data; eventual consistency for persisted data.
Service Cluster Deployment
Micro‑service clusters are deployed equally across regions. Two discovery mechanisms are used:
HSF‑based services with configserver per region, isolating traffic.
HTTP‑based algorithm services using either region‑specific Zookeeper or a global vipserver load balancer.
Cache usage is categorized into three patterns:
Read‑through cache : Relieves DB pressure; strong consistency not required.
Persistent cache (e.g., distributed lock, counter): Requires strong consistency; writes go to the central master, reads may be local.
Region‑center synchronization : Ensures eventual consistency for persisted data.
Database Deployment
According to the CAP theorem, a distributed database can satisfy only two of consistency, availability, and partition tolerance. Xianyu’s recommendation flow is read‑heavy and can tolerate brief inconsistency, and the existing storage is MySQL. Therefore a master‑slave replication mode was chosen, providing high availability and partition tolerance while sacrificing strong consistency.
Conclusion
Network latency is the primary challenge of multi‑region deployment. Two guiding principles were applied:
Keep traffic closed‑loop within each region to limit cross‑region latency impact.
Prioritize availability over strong consistency.
By enforcing these principles at the access layer, service layer, and storage layer, Xianyu now runs a two‑site three‑data‑center deployment that handles live traffic with low cost and good scalability. Strong‑consistency scenarios remain difficult, and maintaining the architecture as the business evolves is an ongoing concern.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
