Databases 16 min read

CTrip's Redis Governance: Architecture, Scaling, and Hybrid Cloud Practices

This article details CTrip's evolution of Redis management—from early single‑instance deployments and custom CRedis client, through automated scaling, containerization with Kubernetes, secondary scheduling algorithms, Intel Optane integration, and hybrid‑cloud deployment—highlighting operational challenges, solutions, and performance outcomes.

Ctrip Technology
Ctrip Technology
Ctrip Technology
CTrip's Redis Governance: Architecture, Scaling, and Hybrid Cloud Practices

Background

CTrip began using Redis in 2013, initially alongside Memcached. By 2017 Memcached was fully retired and Redis usage grew to tens of thousands of instances and hundreds of terabytes of data, creating significant operational challenges for DBAs in scaling, expansion, and high‑availability.

The company developed a proprietary CRedis client that abstracts the physical deployment, allowing applications to access Redis by logical name without needing to know the underlying topology.

Single‑Instance Multi‑Instance Period

Before 2016, Redis instances were manually deployed on physical servers, with resource allocation based on memory, CPU, and network metrics, similar to MySQL single‑instance management. An automated governance tool, RAT (Redis Administration Tools), was introduced to handle scaling and deployment as instance counts surged.

Rapid growth led to inefficient server utilization (~40% average) and oversized instances (>20 GB), prompting the need for a more advanced deployment governance system.

Containerization Period

Starting in 2018, CTrip migrated Redis to Kubernetes, enabling automated scheduling, region‑based placement, and treating Redis as a PaaS service. This shift required new governance strategies due to the scale increase (from a few thousand to tens of thousands of instances).

3.1 Secondary Scheduling

To address mismatches between requested resources (Requests) and actual usage (UsedMemory), CTrip implemented a secondary scheduling process that rebalances pods based on memory utilization thresholds (e.g., 90% usage triggers auto‑expansion). Two algorithms were designed:

Reservation‑based algorithm : Moves instances from memory‑tight nodes to idle nodes to achieve a target memory availability (e.g., 50%).

Full‑balance algorithm : Minimizes variance of memory usage across nodes by redistributing instances to equalize utilization.

Both algorithms use a First‑Fit‑Decreasing (FFD) approach and incorporate black/white‑list policies, instance priorities, and exclusion rules for special instance types.

Images illustrating the scheduling outcomes are included (Fig 2‑5).

3.2 Automated Migration

A container‑level automated migration system was built to execute planned moves, supporting scenarios such as targeted instance migration, host failure recovery, and scheduled secondary‑scheduling actions. The system updates CRedis access policies, synchronizes data, registers with Xpipe DR, and manages Sentinel configurations.

3.3 Cilium Integration

Cilium was adopted as the cloud‑native networking layer. To keep IP addresses stable during migrations, OVS was used, but large‑scale IP mobility required careful state‑machine design to ensure idempotent, retryable operations.

3.4 Intel Optane Deployment

To reduce costs while meeting memory demands, CTrip evaluated Intel Optane SSD and AEP solutions. Optane SSD offered ~0.9 ms latency (vs 0.3 ms for RAM) with 60 % cost savings, while Optane AEP provided comparable performance to pure memory with ~50 % cost reduction, leading to a 32C CPU + 4 × 128 GB AEP configuration.

Hybrid Cloud Period

From 2019, Redis was deployed across public clouds to serve overseas traffic. Instances were zone‑aware for disaster tolerance, and a cost‑effective alternative using the open‑source KVROCKS (compatible with Redis SYNC/PSYNC) was adopted, achieving 60‑80 % cost savings with comparable latency.

Conclusion

CTrip's Redis governance demonstrates a pragmatic, non‑vendor‑locked approach that combines custom client abstractions, automated scaling, containerization, secondary scheduling, hardware acceleration, and hybrid‑cloud strategies to maintain high availability and performance at massive scale.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Kubernetesrediscontainerizationhybrid cloudDatabase operations
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.