Cloud Computing 16 min read

Ctrip Container Cloud: Architecture, Elastic Scaling, and Monitoring Practices

This article details Ctrip's journey in building a private container cloud to support rapid business growth, covering elasticity challenges, container deployment principles, orchestration platform choices, network design, operational issues, custom executors, monitoring solutions, and the overarching CDOS system.

Ctrip Technology
Ctrip Technology
Ctrip Technology
Ctrip Container Cloud: Architecture, Elastic Scaling, and Monitoring Practices

Wu Yiting, senior director of Ctrip's system R&D department, describes how Ctrip's rapidly growing travel business required a more flexible and faster elastic scaling solution, leading to the development of a private container cloud and continuous delivery platform serving over 20 business units.

The seasonal nature of travel traffic creates sudden spikes that demand rapid expansion and slower contraction, prompting the shift from VM-based scaling (taking minutes per instance) to container-based scaling capable of adding thousands of cores within minutes.

Ctrip's container cloud positioning focuses on four goals: delivering an extreme continuous delivery experience, improving resource utilization through billing and optimization, service-ifying components such as MySQL/Redis/RabbitMQ, and advancing automation toward intelligent self‑healing infrastructure.

Key deployment principles include single‑container‑single‑application, single‑container‑single‑routable‑IP, immutable container images, and keeping only the application inside containers while host agents handle monitoring, logging, and configuration.

For orchestration, Ctrip evaluated OpenStack (rejected due to complexity and slow scheduling), Kubernetes (rejected due to integration challenges and immature networking), and ultimately adopted a Mesos‑based solution that allowed custom scheduling frameworks, efficient cron‑job handling, and seamless integration with existing L4/L7 load balancers.

Network design settled on Neutron + OVS + VLAN for stability, with ongoing experiments on DPDK, hardware acceleration, and a lightweight SDN controller based on BGP EVPN and VXLAN offload.

Operational challenges with Docker and Mesos (high container churn, kernel lockups, API performance) led to the creation of a custom Go‑implemented CExecutor that batches container launches, reducing CPU load and jitter.

Monitoring combines open‑source tools (Telegraf, InfluxDB, Grafana) with Ctrip's proprietary Hickwall system, providing per‑container metrics, automatic discovery, and aggregated business‑level health dashboards that predict order volume and trigger alerts.

The CDOS (Ctrip Data Center Operating System) layer orchestrates resources across multiple data centers, handling both cron jobs and long‑running services, supporting both Docker and Windows containers, and enabling second‑level image distribution via Ceph.

Overall, the article presents a comprehensive view of Ctrip's container cloud architecture, scaling strategies, orchestration choices, network configuration, monitoring infrastructure, and future directions.

monitoringDockerCloud Computingelastic scalingMesosContainer Orchestrationcdos
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.