CTrip’s Large‑Scale Redis Containerization: Architecture, Practices, and Lessons Learned
This article details CTrip’s experience of containerizing a 200 TB+ Redis deployment with millions of queries per second, covering the motivations, architecture, Kubernetes strategies, performance testing, operational challenges, and the practical solutions they devised to achieve high scalability and resource efficiency.
In CTrip’s environment, Redis stores over 200 TB of data and handles millions of queries per second, making containerization a significant challenge; the article shares practical experiences from the 2018 CTrip Technology Summit.
Background : CTrip accesses Redis through CRedis client clusters, where each cluster maps to a pool, group, and one or more instances, using consistent hashing for key distribution.
Why Containerize : Standardization and automation via containers reduce deployment effort by 59× compared to manual physical‑machine installs; scaling large numbers of split instances becomes manageable, and resource utilization improves through Kubernetes orchestration.
Feasibility : A/B tests comparing containerized Redis to physical machines show a modest 5‑10% performance overhead due to virtual network interfaces, confirming that Redis can be safely containerized.
Architecture and Details : The solution uses Kubernetes StatefulSet for Redis, with nodeAffinity, podAntiAffinity, and tolerations to isolate Redis pods on dedicated hosts. Custom schedulers (sticky‑scheduler) enforce host affinity, while custom volume plugins (chostpath, cemptydir) provide XFS quota‑based disk limits. Each pod runs two containers: Redis and a Telegraf monitoring sidecar that reports metrics every 60 seconds.
Problems Encountered :
Fixed IP and host requirements for master/slave topology.
Preserving Redis configuration files across pod restarts.
Ensuring Sentinel, not Kubernetes, handles master failover.
Managing memory limits to avoid OOM while respecting existing operational practices.
CPU over‑subscription due to Redis’s single‑threaded nature.
System load spikes caused by Telegraf’s frequent job collection, resolved by adding jitter.
Slowlog anomalies caused by kernel bugs, fixed by upgrading to kernel 4.14.
XFS bugs (header misalignment and xfsaild D‑state) mitigated by kernel patches.
Scheduling Strategies : Pods are classified by business importance and scheduled across multiple regions (Kubernetes clusters). Memory requests equal maxmemory, with a 10% safety margin, and a custom scoring system prefers nodes with the most free memory. Periodic jobs rebalance pods to minimize memory variance and avoid OOM.
Key Takeaways : Containerizing Redis at massive scale requires coordinated efforts across Kubernetes, custom scheduling, monitoring, and resource management. Proper over‑subscription and dynamic adjustments (e.g., HZ tuning, fragmentation cleanup) enable high resource utilization while maintaining stability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
