Scaling a 10,000‑Node Container Cloud: Ctrip’s Ops Practices and Lessons
This article details Ctrip's journey of building and operating a massive container cloud platform, covering its architectural evolution, operational challenges, tooling, capacity management, and future directions, offering practical insights for large‑scale cloud‑native environments.
Introduction
The speaker shares the practice of operating a ten‑thousand‑node container cloud platform at Ctrip, describing the pitfalls encountered and the solutions adopted during the evolution from OpenStack to Kubernetes.
1. Overview of Ctrip Container Cloud
Ctrip runs three self‑built data centers plus public‑cloud resources (Alibaba Cloud, Tencent Cloud, AWS) supporting over 6,000 applications across hotel, flight, and travel services. The container cloud now manages about 2,000 nodes with 7‑8 k releases per day.
Initially, container scheduling was done with Python scripts, later migrated to a Kubernetes‑based platform. The infrastructure is primarily private‑cloud, with public‑cloud bursting during peak travel seasons.
Applications are classified as standard (Java, Python) or specialized (Application, Cache). The technology roadmap progressed through three stages: OpenStack (2015), a self‑built Mesos framework (2016), and full migration to Kubernetes (2017‑2018), delivering a PaaS model for containers.
2. Ctrip Container Cloud Operations Practice
2.1 Challenges
Key challenges include managing compute, storage (using Ceph), and a massive increase in IP addresses due to a one‑container‑one‑application model, requiring SDN for VPC isolation, and ensuring resource isolation for CPU and network during traffic spikes.
2.2 Basic Operations
Hosts run CentOS 7.1, Docker 1.13, and kernel 4.14. Image storage uses Harbor with distributed storage. CPU and network isolation are enforced, and the scheduling platform is gradually migrated to Kubernetes.
2.3 Operational Tools
Configuration management relies on SaltStack and Rundeck. Monitoring and alerting use a custom Ctrip‑Hickwall tool and Prometheus. Logging stacks include ELK, TIGK, and Elastic Beats, feeding data to AIOps pipelines. StackStorm powers ChatOps and automated incident handling.
2.4 Operational Process
All operational standards are applied via SaltStack, including kernel upgrades. Collaboration with other leading container platforms (Alibaba, JD, NetEase) informs continuous improvement across efficiency, cost, and innovation.
2.5 Change Awareness
Monitoring platform changes serves as early warning for potential failures. Engineers are encouraged to investigate anomalies deeply, conduct gray‑release testing, and use dashboards for host, image, and network metrics.
2.6 Trend Monitoring
Long‑term trend analysis guides capacity planning, preventing reactive firefighting. Visual dashboards display cluster CPU/memory utilization, application container counts, and resource usage spikes.
2.7 Capacity Management
Capacity decisions are driven by monitoring data, Hadoop analytics, and PAAS scheduling to achieve elastic computing while controlling costs.
3. Summary and Outlook
The team plans to integrate hybrid‑cloud management across private and public clouds, continue advancing Kubernetes adoption, and further develop elastic computing capabilities. Emphasis remains on user‑centric platform stability, continuous delivery, and leveraging ChatOps for knowledge sharing.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.