Cloud Computing 15 min read

Multi-Region High Availability Architecture for Xianyu Recommendation Service

The Xianyu recommendation service was re‑architected into an active‑active, multi‑region high‑availability system—using a unified access‑layer router, centralizing long‑tail dependencies, keeping data unsharded, refactoring caches and MySQL replication, and adhering to traffic‑closed‑loop and availability‑first principles—to overcome latency, improve scalability, and ensure low‑cost disaster recovery across two regions and three data centers.

Xianyu Technology

Oct 14, 2021

Multi-Region High Availability Architecture for Xianyu Recommendation Service

Background: Xianyu's homepage and search are critical services. The original IDC resources are saturated, leading to scalability and disaster‑recovery issues, especially for algorithm models that cannot be deployed across regions.

Common HA patterns: (1) Same‑city active‑active deployment, (2) Cross‑region disaster‑recovery (cold/hot standby), (3) Cross‑region active‑active deployment. The latter offers better resource utilization and scalability.

Impact of multi‑region deployment: traffic scheduling, traffic closed‑loop within a region, data consistency across regions, and disaster‑cutover mechanisms become key concerns.

Key challenges: high inter‑region latency (20 ms+), traffic balancing, increased architectural complexity, precise traffic routing, and consistent deployment standards.

Data sharding decision: because the recommendation flow tolerates short‑term inconsistency and does not involve write operations, data is not split across regions.

Deployment architecture: all regions are physically equal, but a central region hosts long‑tail dependencies that cannot be multi‑region. This ensures a fallback zone while keeping most services distributed.

Traffic routing solutions evaluated: (1) DNS‑level routing (high cost, slow convergence), (2) Unified access‑layer routing (low cost, reuse existing logic), (3) Edge‑gateway routing (flexible but requires new infrastructure). The team selected option 2.

Full‑link upgrade includes: application code refactoring for multi‑region compatibility, service‑level routing policies, traffic correction at each hop, and external‑traffic isolation to the central region.

Service cluster deployment: three categories – HSF + configserver (region‑isolated), HTTP services using Zookeeper (region‑specific) or vipserver (global), ensuring traffic isolation and fallback capabilities.

Cache refactor principles: differentiate between non‑persistent caches (no change) and persistent caches (force writes to the central master, read locally when possible) to avoid cross‑region inconsistency.

Database deployment follows CAP trade‑offs: chosen MySQL master‑slave replication (favoring availability and partition tolerance, accepting eventual consistency) over bidirectional or Paxos‑based solutions.

Summary: The primary obstacle of multi‑region deployment is network latency. The design adheres to two principles – traffic closed‑loop within a region and availability‑first – and applies them across access, service, and storage layers. Xianyu now runs a two‑region three‑datacenter setup with active‑active capability, low‑cost scalability, and robust disaster recovery.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

system architecture High Availability traffic routing database replication multi-region deployment

Written by

Xianyu Technology

Official account of the Xianyu technology team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.