How Ctrip Achieved Seconds‑Level Scaling with a Container Cloud
Ctrip built a private container cloud to handle massive seasonal traffic spikes, enabling rapid, automated scaling and shrinking of resources, improving deployment speed, resource utilization, and operational intelligence across more than 20 business units.
Online Travel and Elastic Demand
Ctrip's travel business has grown explosively, with revenue increasing 76% year‑over‑year in 2016 and GMV projected to exceed 1 trillion RMB in 2021. Seasonal traffic spikes require rapid, large‑scale capacity expansion, but traditional VM provisioning takes ten minutes per instance, limiting flexibility.
Ctrip Container Cloud Positioning
The platform focuses on four goals: delivering second‑level continuous delivery for 20+ BUs, improving resource utilization through billing and monitoring, service‑ifying components such as MySQL/Redis/RabbitMQ, and advancing automation toward intelligent self‑healing infrastructure.
Basic Container Deployment Principles
One container per application
One container per routable IP
Immutable container images
All agents run on the host, not inside containers
These principles stem from extensive operational experience.
Orchestration Selection & Trade‑offs
Initially Ctrip used OpenStack with a Nova‑Docker module, but its complexity and slow scheduling (10 s+ API latency) led to a shift toward Mesos and custom frameworks. K8s offered advanced features but conflicted with existing deployment pipelines, while Mesos allowed easier integration with Ctrip's services.
Container Network Selection
Ctrip adopted Neutron + OVS + VLAN for stable, transparent networking, evaluating DPDK and hardware acceleration. VXLAN + BGP EVPN is being prototyped for a lightweight SDN controller, while flannel‑style networks were rejected due to IP migration issues.
Docker‑Related Issues
Frequent container creation/destruction caused kernel soft lockups and performance bottlenecks. Ctrip built a custom Mesos framework and a Go‑based CExecutor to batch jobs, reducing container churn, CPU load spikes, and related bugs.
Container Monitoring Solution
The in‑house "hickwall" system collects CPU, memory, and disk metrics via Docker client and cgroup, automatically discovers new containers, and aggregates data per application cluster. It also monitors business‑level metrics such as order volume, enabling predictive scaling with >95% accuracy.
CDOS Overview
CDOS (Ctrip Data Center Operating System) orchestrates resources across multiple data centers, scheduling compute, network, and storage for both long‑running services and cron jobs. It leverages Mesos for low‑level allocation, supports Windows containers, and uses Ceph for fast image distribution, aiming for second‑level delivery.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
