CTrip's Redis Governance: Architecture, Scaling, and Hybrid Cloud Practices
This article details CTrip's evolution of Redis management—from early single‑instance deployments and custom CRedis client, through automated scaling, containerization with Kubernetes, secondary scheduling algorithms, Intel Optane integration, and hybrid‑cloud deployment—highlighting operational challenges, solutions, and performance outcomes.
Background
CTrip began using Redis in 2013, initially alongside Memcached. By 2017 Memcached was fully retired and Redis usage grew to tens of thousands of instances and hundreds of terabytes of data, creating significant operational challenges for DBAs in scaling, expansion, and high‑availability.
The company developed a proprietary CRedis client that abstracts the physical deployment, allowing applications to access Redis by logical name without needing to know the underlying topology.
Single‑Instance Multi‑Instance Period
Before 2016, Redis instances were manually deployed on physical servers, with resource allocation based on memory, CPU, and network metrics, similar to MySQL single‑instance management. An automated governance tool, RAT (Redis Administration Tools), was introduced to handle scaling and deployment as instance counts surged.
Rapid growth led to inefficient server utilization (~40% average) and oversized instances (>20 GB), prompting the need for a more advanced deployment governance system.
Containerization Period
Starting in 2018, CTrip migrated Redis to Kubernetes, enabling automated scheduling, region‑based placement, and treating Redis as a PaaS service. This shift required new governance strategies due to the scale increase (from a few thousand to tens of thousands of instances).
3.1 Secondary Scheduling
To address mismatches between requested resources (Requests) and actual usage (UsedMemory), CTrip implemented a secondary scheduling process that rebalances pods based on memory utilization thresholds (e.g., 90% usage triggers auto‑expansion). Two algorithms were designed:
Reservation‑based algorithm : Moves instances from memory‑tight nodes to idle nodes to achieve a target memory availability (e.g., 50%).
Full‑balance algorithm : Minimizes variance of memory usage across nodes by redistributing instances to equalize utilization.
Both algorithms use a First‑Fit‑Decreasing (FFD) approach and incorporate black/white‑list policies, instance priorities, and exclusion rules for special instance types.
Images illustrating the scheduling outcomes are included (Fig 2‑5).
3.2 Automated Migration
A container‑level automated migration system was built to execute planned moves, supporting scenarios such as targeted instance migration, host failure recovery, and scheduled secondary‑scheduling actions. The system updates CRedis access policies, synchronizes data, registers with Xpipe DR, and manages Sentinel configurations.
3.3 Cilium Integration
Cilium was adopted as the cloud‑native networking layer. To keep IP addresses stable during migrations, OVS was used, but large‑scale IP mobility required careful state‑machine design to ensure idempotent, retryable operations.
3.4 Intel Optane Deployment
To reduce costs while meeting memory demands, CTrip evaluated Intel Optane SSD and AEP solutions. Optane SSD offered ~0.9 ms latency (vs 0.3 ms for RAM) with 60 % cost savings, while Optane AEP provided comparable performance to pure memory with ~50 % cost reduction, leading to a 32C CPU + 4 × 128 GB AEP configuration.
Hybrid Cloud Period
From 2019, Redis was deployed across public clouds to serve overseas traffic. Instances were zone‑aware for disaster tolerance, and a cost‑effective alternative using the open‑source KVROCKS (compatible with Redis SYNC/PSYNC) was adopted, achieving 60‑80 % cost savings with comparable latency.
Conclusion
CTrip's Redis governance demonstrates a pragmatic, non‑vendor‑locked approach that combines custom client abstractions, automated scaling, containerization, secondary scheduling, hardware acceleration, and hybrid‑cloud strategies to maintain high availability and performance at massive scale.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
