Evolution of Ctrip's Database High‑Availability and Disaster‑Recovery Architecture (1999‑2018)
This article chronicles Ctrip's database high‑availability and disaster‑recovery evolution—from simple SQL Server mirroring in the early years, through SAN‑based clustering and AlwaysOn, to the adoption of MySQL, Redis, MHA, and a one‑click DR automation tool—highlighting architectural decisions, challenges, and operational lessons learned.
Author Bio: Gao Deguang, senior database manager at Ctrip Technology Assurance Center, responsible for database operations, high‑availability (HA) and disaster‑recovery (DR) for SQL Server, MySQL, and Redis.
Website stability is critical; prolonged outages cause revenue loss and customer churn. Database HA/DR is a key component of Ctrip's overall high‑availability strategy.
1.0 Era (1999‑2008) – The company primarily used SQL Server. Architecture was simple: database mirroring with multiple primary databases sharing a single secondary server. Failover required manual restart or manual switch to the mirror, offering low cost but slow recovery and limited true HA.
2.0 Era (2008‑2012) – Rapid business growth led to SAN shared storage, replication distribution for read/write separation, and Failover Cluster for HA. DR still relied on mirroring. The architecture introduced automatic failover (≈2 minutes) and a read‑only replica for BI and backup verification.
Complexities emerged: tangled replication chains and heavy dependence on SAN, prompting a shift to Microsoft AlwaysOn Availability Groups (introduced 2012) and SSDs to replace SAN.
3.0 Era (2012‑2014) – AlwaysOn became mature, supporting up to eight readable replicas with low latency, eliminating the need for separate read‑only databases and reducing backup load on the primary.
4.0 Era (2014‑2018) – Recognizing the closed‑source nature of AlwaysOn, Ctrip gradually introduced open‑source MySQL and Redis. MySQL HA/DR was built with MHA (Master High Availability), using domain/virtual IP failover and dynamic data‑source routing to mitigate split‑brain risks. Redis HA/DR leveraged the in‑house CRedis middleware and Sentinel, with multi‑group sharding, cross‑IDC replication via XPipe, and a one‑click DR automation tool covering single clusters, whole business lines, or entire IDC failures.
The DR tool automates metadata‑driven switch plans, generates work orders for forced or rehearsal switches, supports concurrent batch operations, and is itself HA‑aware, requiring only one IDC to be up.
Overall, Ctrip's database layer evolved from simple, manually managed mirroring to a sophisticated, automated, multi‑technology ecosystem that dramatically improved stability, availability, and operational efficiency.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.