Operations 15 min read

How Ctrip Boosted Efficiency with AIOps: Real-World AI Operations Cases

This article explores Ctrip's adoption of AIOps—AI‑driven IT operations—detailing its concepts, typical use cases such as anomaly detection, intelligent fault diagnosis, and resource‑utilization improvement, and demonstrating how machine‑learning models like ARMA, FFT, and SVM have transformed operational efficiency, availability, and cost.

Efficient Ops
Efficient Ops
Efficient Ops
How Ctrip Boosted Efficiency with AIOps: Real-World AI Operations Cases

AIOps Overview

With the rise of artificial intelligence, Ctrip's production environment entered a new AIOps era, achieving notable gains in efficiency, availability, and cost after more than two years of investment and practice.

AIOps concept diagram
AIOps concept diagram

Artificial intelligence is divided into weak AI, which excels at specific tasks, and strong AI, which possesses human‑level perception and reasoning; current operational applications rely on weak AI. Successful AI deployment requires three essentials: algorithms, compute power, and data.

Algorithm, compute, data
Algorithm, compute, data

Operations generate massive monitoring and alarm data, providing ideal conditions for AI. Gartner introduced the term AIOps (Algorithmic IT Operations) in 2016, emphasizing big‑data and machine‑learning‑driven automation in monitoring and service‑desk scenarios.

AIOps personnel composition
AIOps personnel composition

Typical AIOps Application Scenarios at Ctrip

The most mature scenarios fall into two categories: availability assurance (anomaly detection, intelligent fault diagnosis, fault prediction, automated remediation) and cost optimization (capacity planning, resource‑utilization improvement, performance tuning).

1. Application Anomaly Detection

Traditional fixed‑threshold alerts suffer from high false‑positive/negative rates, inability to capture gradual trends, and prohibitive maintenance costs across thousands of services.

Anomalous time‑series segment
Anomalous time‑series segment

By building an ARMA model and applying a 3‑sigma rule, Ctrip derives dynamic thresholds that adapt to statistical characteristics of each metric, dramatically improving precision (≈90%) and recall (≈100%).

Normal distribution and dynamic threshold
Normal distribution and dynamic threshold

Dynamic threshold detection visualizes original series (yellow), ARMA‑smoothed series (green), and the 3‑sigma upper bound (red), clearly highlighting outliers.

Detected anomalies in time series
Detected anomalies in time series

Periodic pattern detection uses autocorrelation, filtering, and Fast Fourier Transform (FFT) to uncover seasonal cycles, enabling proactive capacity planning.

FFT analysis of time series
FFT analysis of time series

2. Intelligent Fault Diagnosis

As Ctrip's architecture grew more complex, manual fault isolation became impractical. Faults can arise in network, servers, load balancers, databases, caches, applications, or third‑party services, often propagating through call chains.

Complex website topology
Complex website topology

Ctrip aggregates alerts, deployment, change, and configuration data, then applies factor analysis, correlation analysis, decision‑tree, Markov‑chain, call‑chain (area algorithm), and expert‑knowledge methods. Scores derived from correlation coefficients or Bayesian formulas identify the most probable root cause.

Such intelligent diagnosis reduces mean time to resolution from tens of minutes or hours to a few minutes.

Alarm correlation analysis
Alarm correlation analysis

3. Resource Utilization Improvement

Optimizing resource usage lowers operational cost while meeting business and security requirements.

Application Profiling

Using K‑means, EM and other clustering algorithms on monitoring metrics, release history, and usage patterns, Ctrip classifies applications into CPU‑intensive, memory‑intensive, I/O‑intensive, latency‑sensitive, and frequently‑deployed groups, assigning confidence scores.

Application profiling tags
Application profiling tags

Online/Offline Co‑location

During off‑peak hours, idle online resources are allocated to batch Hadoop/Spark jobs, achieving multi‑fold resource utilization gains and reducing expensive offline resource procurement.

Online/Offline co‑location
Online/Offline co‑location

Intelligent Elastic Scaling

Models built with SVM on CPU, memory, and network metrics predict capacity surplus or shortage. The Spring elastic scaling platform automatically provisions or de‑provisions resources without human intervention, cutting labor costs and improving utilization.

Conclusion

The convergence of Dev and Ops created DevOps; the fusion of AI and Ops gave rise to AIOps, which replaces rule‑based operations with machine‑learning‑driven decisions. Although still evolving, AIOps is poised to become indispensable for handling increasingly complex operational scenarios.

AIOps future outlook
AIOps future outlook
machine learningAnomaly DetectionResource OptimizationAIOpsIT OperationsFault Diagnosis
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.