Cloud Native 16 min read

How Ctrip Achieved Seconds‑Level Scaling with a Container Cloud

Ctrip built a private container cloud to handle massive seasonal traffic spikes, enabling rapid, automated scaling and shrinking of resources, improving deployment speed, resource utilization, and operational intelligence across more than 20 business units.

Efficient Ops
Efficient Ops
Efficient Ops
How Ctrip Achieved Seconds‑Level Scaling with a Container Cloud

Online Travel and Elastic Demand

Ctrip's travel business has grown explosively, with revenue increasing 76% year‑over‑year in 2016 and GMV projected to exceed 1 trillion RMB in 2021. Seasonal traffic spikes require rapid, large‑scale capacity expansion, but traditional VM provisioning takes ten minutes per instance, limiting flexibility.

Ctrip Container Cloud Positioning

The platform focuses on four goals: delivering second‑level continuous delivery for 20+ BUs, improving resource utilization through billing and monitoring, service‑ifying components such as MySQL/Redis/RabbitMQ, and advancing automation toward intelligent self‑healing infrastructure.

Basic Container Deployment Principles

One container per application

One container per routable IP

Immutable container images

All agents run on the host, not inside containers

These principles stem from extensive operational experience.

Orchestration Selection & Trade‑offs

Initially Ctrip used OpenStack with a Nova‑Docker module, but its complexity and slow scheduling (10 s+ API latency) led to a shift toward Mesos and custom frameworks. K8s offered advanced features but conflicted with existing deployment pipelines, while Mesos allowed easier integration with Ctrip's services.

Container Network Selection

Ctrip adopted Neutron + OVS + VLAN for stable, transparent networking, evaluating DPDK and hardware acceleration. VXLAN + BGP EVPN is being prototyped for a lightweight SDN controller, while flannel‑style networks were rejected due to IP migration issues.

Docker‑Related Issues

Frequent container creation/destruction caused kernel soft lockups and performance bottlenecks. Ctrip built a custom Mesos framework and a Go‑based CExecutor to batch jobs, reducing container churn, CPU load spikes, and related bugs.

Container Monitoring Solution

The in‑house "hickwall" system collects CPU, memory, and disk metrics via Docker client and cgroup, automatically discovers new containers, and aggregates data per application cluster. It also monitors business‑level metrics such as order volume, enabling predictive scaling with >95% accuracy.

CDOS Overview

CDOS (Ctrip Data Center Operating System) orchestrates resources across multiple data centers, scheduling compute, network, and storage for both long‑running services and cron jobs. It leverages Mesos for low‑level allocation, supports Windows containers, and uses Ceph for fast image distribution, aiming for second‑level delivery.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringcloud-nativecontainerizationscalingCtrip
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.