Operations 17 min read

Scaling a 10,000‑Node Container Cloud: Ctrip’s Ops Practices and Lessons

This article details Ctrip's journey of building and operating a massive container cloud platform, covering its architectural evolution, operational challenges, tooling, capacity management, and future directions, offering practical insights for large‑scale cloud‑native environments.

Efficient Ops

Feb 14, 2019

Scaling a 10,000‑Node Container Cloud: Ctrip’s Ops Practices and Lessons

Introduction

The speaker shares the practice of operating a ten‑thousand‑node container cloud platform at Ctrip, describing the pitfalls encountered and the solutions adopted during the evolution from OpenStack to Kubernetes.

1. Overview of Ctrip Container Cloud

Ctrip runs three self‑built data centers plus public‑cloud resources (Alibaba Cloud, Tencent Cloud, AWS) supporting over 6,000 applications across hotel, flight, and travel services. The container cloud now manages about 2,000 nodes with 7‑8 k releases per day.

Initially, container scheduling was done with Python scripts, later migrated to a Kubernetes‑based platform. The infrastructure is primarily private‑cloud, with public‑cloud bursting during peak travel seasons.

Applications are classified as standard (Java, Python) or specialized (Application, Cache). The technology roadmap progressed through three stages: OpenStack (2015), a self‑built Mesos framework (2016), and full migration to Kubernetes (2017‑2018), delivering a PaaS model for containers.

2. Ctrip Container Cloud Operations Practice

2.1 Challenges

Key challenges include managing compute, storage (using Ceph), and a massive increase in IP addresses due to a one‑container‑one‑application model, requiring SDN for VPC isolation, and ensuring resource isolation for CPU and network during traffic spikes.

2.2 Basic Operations

Hosts run CentOS 7.1, Docker 1.13, and kernel 4.14. Image storage uses Harbor with distributed storage. CPU and network isolation are enforced, and the scheduling platform is gradually migrated to Kubernetes.

2.3 Operational Tools

Configuration management relies on SaltStack and Rundeck. Monitoring and alerting use a custom Ctrip‑Hickwall tool and Prometheus. Logging stacks include ELK, TIGK, and Elastic Beats, feeding data to AIOps pipelines. StackStorm powers ChatOps and automated incident handling.

2.4 Operational Process

All operational standards are applied via SaltStack, including kernel upgrades. Collaboration with other leading container platforms (Alibaba, JD, NetEase) informs continuous improvement across efficiency, cost, and innovation.

2.5 Change Awareness

Monitoring platform changes serves as early warning for potential failures. Engineers are encouraged to investigate anomalies deeply, conduct gray‑release testing, and use dashboards for host, image, and network metrics.

2.6 Trend Monitoring

Long‑term trend analysis guides capacity planning, preventing reactive firefighting. Visual dashboards display cluster CPU/memory utilization, application container counts, and resource usage spikes.

2.7 Capacity Management

Capacity decisions are driven by monitoring data, Hadoop analytics, and PAAS scheduling to achieve elastic computing while controlling costs.

3. Summary and Outlook

The team plans to integrate hybrid‑cloud management across private and public clouds, continue advancing Kubernetes adoption, and further develop elastic computing capabilities. Emphasis remains on user‑centric platform stability, continuous delivery, and leveraging ChatOps for knowledge sharing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native Operations Kubernetes capacity management container cloud

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.