Ctrip's Container Cloud: Architecture, Self‑Developed Framework, and Engineering Practices
This article describes how Ctrip built a container‑cloud platform by dividing infrastructure into "under‑water" (foundations) and "over‑water" (developer‑facing practices), evolving through three architectural stages, and addressing networking, monitoring, resource fragmentation, JVM tuning, Dockerfile management, and a plugin‑based extension model to support DevOps and micro‑service deployments.
Author Introduction Wang Xiaojun has been focusing on cloud platforms and continuous delivery since 2015, leading Ctrip's migration to a micro‑service‑friendly release system and a private‑cloud‑compatible container platform.
Motivation With the rise of micro‑services, container technologies such as Docker and Kubernetes have become central, prompting Ctrip to create its own container cloud platform.
Two‑Level Architecture Ctrip splits the container cloud into "under‑water" (the underlying infrastructure) and "over‑water" (the engineering practices that directly affect developers). Both layers must be well‑designed for the platform to truly realize DevOps goals.
Stage 1 – VM Simulation via OpenStack The initial phase used OpenStack Nova to manage Docker containers as virtual machines, allowing existing applications to run unchanged while testing compatibility.
Stage 2 – Image Release with Chronos The second phase introduced immutable delivery by publishing applications as images and using Mesos + Chronos to schedule job‑type workloads, dropping support for long‑running services to simplify testing.
The architecture proved that Mesos scheduling incurred high overhead for massive concurrent jobs, leading Ctrip to develop its own framework.
Stage 3 – Self‑Developed Framework The third stage addressed four key requirements: supporting both Job and Service workloads, assigning independent IPs to each container, handling stateful applications, and providing a complete monitoring system.
Network Ctrip requires a single‑IP‑per‑container model, using Neutron + OVS + VLAN. A custom initialization hook fetches network information (e.g., subnet, Neutron port) after container start, persisting the configuration.
Monitoring Monitoring consists of two parts: Mesos cluster health (using Telegraf, InfluxDB, Grafana) and container‑level metrics (via the in‑house Hickwall agent, which auto‑discovers containers and aggregates metrics per application cluster).
Resource Offer Fragmentation Mesos offers resources in 2‑core chunks, causing "offer fragmentation" for larger requests. Ctrip solves this by shortening the offer timeout, forcing resource reclamation and re‑offering.
Engineering Practices * CI/CD supports both image‑based and code‑package releases for seamless migration. * SSH is replaced by a web console that execs into containers. * Tomcat runs under Supervisord because container exit on Tomcat failure would kill the container. * JVM OOM issues were traced to CPU‑quota mis‑reporting; Ctrip now derives JVM settings from container flavor (e.g., Xmx = 80% of flavor memory).
Dockerfile Management Unrestricted Dockerfile customization can break standards; Ctrip introduces a "plugin" model where common functionalities (e.g., installing FTP, Jacoco) are packaged as plugins that can be selected during image build, preserving consistency while allowing extensibility.
Jacoco Plugin Example The Jacoco plugin installs the agent via Dockerfile, registers a Supervisord hook to notify the Jacoco service after Tomcat starts, and provides an API to control the agent at runtime.
Conclusion DevOps and containerization require practical, ground‑up solutions; Ctrip's container cloud demonstrates that a well‑designed infrastructure, engineering practices, and extensible services must evolve together to achieve reliable, scalable micro‑service deployments.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.