Cloud Native 16 min read

Ctrip Container Cloud: Architecture, Scaling, and Operational Practices

The article details Ctrip's rapid business growth driving the need for elastic scaling, the adoption of container technology to achieve second‑level provisioning, the design of their container cloud platform—including deployment principles, network choices, orchestration evaluations, monitoring solutions, and the CDOS overview—providing practical insights for large‑scale cloud‑native operations.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Ctrip Container Cloud: Architecture, Scaling, and Operational Practices

Author Background Wu Yiting, senior director of Ctrip's System R&D Department, joined Ctrip in 2012 and built the Ctrip cloud platform from scratch, now responsible for private cloud, virtual desktop cloud, and continuous delivery.

Business Growth and Elastic Demand With booming travel consumption, Ctrip's revenue grew 76% YoY in 2016, and GMV is projected to exceed 10 trillion RMB in 2018 and 20 trillion RMB in 2021. Seasonal traffic spikes during holidays create massive, sudden scaling requirements, prompting the need for faster, more flexible elasticity.

Why Containers? Traditional VM provisioning takes ten minutes per instance and cannot meet the need to pre‑scale thousands of machines quickly. Containers enable rapid scaling—Ctrip achieved 1,000‑core expansion within five minutes for a deep‑learning poetry service—by reducing provisioning time and supporting API‑driven, asynchronous scaling.

Ctrip Container Cloud Positioning Four key goals: (1) Deliver an extreme continuous‑delivery experience for over 20 BUs; (2) Improve resource utilization through billing and monitoring; (3) Service‑ify components (MySQL, Redis, MQ, etc.) to enable PaaS‑style usage; (4) Move from automation toward intelligent self‑healing infrastructure.

Container Deployment Principles 1. One container per application. 2. One container per routable IP. 3. Immutable container images. 4. Run only the app inside the container; all agents (monitoring, logging, etc.) run on the host.

Network Selection Ctrip uses Neutron + OVS + VLAN as a stable, transparent solution, testing DPDK and hardware acceleration. They evaluated flannel (discarded due to IP changes on migration) and plan to adopt VXLAN + BGP EVPN with a custom lightweight SDN controller.

Docker Issues and Mesos/K8s Evaluation Frequent container creation/destruction caused kernel soft lockups; a custom CExecutor (written in Go) reduced churn and CPU load. Mesos was chosen over OpenStack (complex, slow scheduling) and K8s (conflicts with existing deployment pipelines) for its flexibility and integration with Ctrip's services.

Monitoring Solutions Ctrip built a monitoring stack using Telegraf, InfluxDB, and Grafana, extended to monitor Mesos clusters, container metrics, and business‑level KPIs (order volume, capacity planning). They also developed the Hickwall system for automatic container discovery and health monitoring.

CDOS Overview The Ctrip Data Center Operating System (CDOS) orchestrates resources across multiple data centers, handling both long‑running services and cron jobs, supporting Docker and Windows containers, and ensuring second‑level delivery through fast image distribution via Ceph.

Conclusion The article provides a comprehensive view of Ctrip's container cloud architecture, from business drivers to technical implementation, offering valuable lessons for large‑scale cloud‑native deployments.

MonitoringCloud NativeDevOpsContainerizationScalingorchestration
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.