Operations 21 min read

How Ctrip Scales Application Operations: Practices, Automation, and Reliability

This talk details Ctrip's application operations framework, covering data‑center scale, multi‑application deployment on Windows, high availability goals, capacity‑prediction models, disaster‑recovery design, incident response, and the evolution from manual tooling to automated, intelligent operations.

Efficient Ops

Oct 6, 2016

How Ctrip Scales Application Operations: Practices, Automation, and Reliability

Editor

Mouse

Hello, I am Chen Jie from Ctrip Technology Assurance Center, responsible for application operations.

Today's talk has three parts: an overview of Ctrip's application operations system and best practices, the journey of operations automation, and recent explorations in intelligent operations.

Ctrip operates four data centers with tens of thousands of servers across more than 30 business units, serving over 5,000 active services. Daily request volume reaches 3 billion, primarily on .NET and Java platforms, with most servers running Windows.

Key challenges include supporting ten‑fold business growth (traffic doubles each year, >4,000 releases per week) and a deployment model of multiple applications per Windows host—sometimes 40‑50 apps on a single machine—resulting in deep dependency chains (up to 15 levels) and a target availability of 99.9% (reached 99.99% in Q1).

An application operations engineer bridges system operations and product development, building configuration and delivery pipelines, ensuring stable, reliable user experiences. They act as both drivers and maintainers of services, embodying Site Reliability Engineering (SRE) principles.

Core goals are stability, performance, delivery efficiency, and cost control (hardware & personnel). Automation reduces hardware waste and human effort, keeping operations headcount growth below business growth.

Security is also critical: proper authorization, rate limiting, and protection against misuse are mandatory.

Key practices include an application review process before release, assessing non‑functional requirements (statelessness, fault tolerance, circuit‑breaker, degradation), and capacity management through predictive modeling that links business metrics (request count, response time) with system metrics (CPU, bandwidth). Models are validated via online load testing and used for automatic scaling.

Disaster‑recovery sites are built across multiple data centers to ensure independent operation and seamless traffic shifting, supporting blue‑green deployments and regular failover drills.

The NOC monitors 32 large screens with real‑time metrics; severe incidents trigger high‑priority escalation and 24/7 conference calls involving operations and development engineers. Incident analysis (COE) traces root causes down to specific code lines and drives continuous improvement.

The automation journey progressed through three stages: tooling (replacing manual steps with remote‑execution tools and batch handlers), data‑driven management (building an application‑centric configuration database with automated updates), and platformization (full‑automation, unattended pipelines using a ticket‑service workflow and message queues).

Current tools provide a unified view of clusters, showing server status, CPU types, and alert counts, and allow authorized users to perform actions such as container restarts directly from the interface.

Intelligent operations experiments include automatic fault repair using sensor data and rule‑engine decisions, and root‑cause analysis based on dependency graphs that map application call chains across up to 15 layers, applying correlation algorithms to pinpoint likely failure sources.

During the Q&A, the speakers discussed career longevity in operations, the importance of tool‑developer feedback loops, standards for tool creation, and the balance between tooling and system complexity.

Q&A Session

Host: Thank you, Mr. Chen!

Gao Jun: Discussing age limits shows progress; the focus should be on capability.

Chen Jie: Operations can be a stepping stone to any role.

... (additional Q&A omitted for brevity) ...

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud computing automation Operations incident management capacity planning site reliability

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.