How Ctrip Scales Continuous Delivery: 6000 Daily Deploys with Jenkins & K8s
This article details Ctrip's large‑scale continuous delivery practice, describing their deployment pipeline, unified build platform, Jenkins integration, Kubernetes‑based elastic agents, monitoring, and the challenges and improvements they encountered while handling thousands of daily deployments across multiple environments.
1. Ctrip Continuous Delivery Overview
Ctrip manages over 8,000 applications with more than 3,000 developers, performing over 6,000 deployments daily, making continuous delivery a critical capability.
Benefits include increased efficiency by automating multi‑environment deployments, quality assurance through integrated code scans and tests, safety by reducing manual errors, tighter team collaboration via small, fast iterations, and transparent processes with unified standards and tooling.
The simplified delivery flow starts with developers pushing code, followed by scanning, unit and integration testing, version creation, packaging, deployment to test environments, automated testing, QA approval, and progressive promotion to further test stages and production.
Branch management uses a Master branch and Feature branches, visualized to expose conflicts early and allow hot‑fixes to be merged without disrupting ongoing work.
Versioning goes beyond Git tags; source code is packaged into explicit versions to capture both code and environment dependencies, especially after container adoption.
Deployment evolution progressed from VM‑based single‑machine multi‑application deployments (pre‑2015) to a single‑application model, then to “fat containers” in 2016, and finally to a container‑orchestrated platform supporting VMs, containers, and bare metal across private and public clouds.
Key concepts include:
Group – a set of services exposed together, the basic deployment unit.
Pull‑in/Pull‑out – controlling traffic to a group member.
Bastion – the first verified machine in production, similar to a canary.
Ignition – successful initialization, pre‑heat, and data loading.
Batch – rolling deployment in multiple batches.
Downgrade – bypassing pull‑in/out to keep the service running during failures.
Brake – halting deployment when failure rates exceed a threshold.
Rollback – reverting to a stable version.
2. Unified Build Platform Design
The build platform, heavily based on Jenkins, provides a unified API layer for various build requests and a worker layer that schedules jobs to appropriate Jenkins masters.
Jenkins offers a rich plugin ecosystem and pipeline‑as‑code, but suffers from single‑point failures and scalability limits. Solutions include active‑passive master pairs, Keepalived virtual IPs, and running Jenkins masters on Mesos or Kubernetes, or using CloudBees.
Master scaling is addressed by splitting masters across environments, product lines, organizations, plugin sets, and access controls. Capacity estimation uses a formula based on developer count to predict the number of jobs, masters, and executors; Ctrip runs ~12,000 builds daily across >20,000 jobs.
The architecture wraps an API layer, dispatches jobs to workers, which then select suitable masters based on labels and capacity. Monitoring spans OS metrics, application availability (API, worker, Jenkins), and business‑logic metrics such as queue length and capacity thresholds.
3. Jenkins on Kubernetes Practice
Jenkins agents (slaves) are elastically scheduled on Kubernetes. Parameters like initialDelay, decay, and a threshold m control when new agents are created, balancing responsiveness and resource waste.
Workspace retention is handled by sharing a common volume between master and agents, avoiding data loss when agents are terminated. Job affinity ensures a job prefers the master where its workspace already resides.
StatefulSets manage Jenkins masters, providing stable pod identities and node affinity, while a custom CHostpath volume driver mounts shared directories to both masters and agents, improving build performance.
4. Issues and Improvements
Challenges include managing per‑environment Docker images, handling non‑standard containerized applications, and mixed‑resource utilization. Solutions involve moving configuration out of code, standardizing image builds, and enabling resource sharing across workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
