Operations 15 min read

How Ctrip Boosted Efficiency with AIOps: Real-World AI Operations Cases

This article explores Ctrip's adoption of AIOps—AI‑driven IT operations—detailing its concepts, typical use cases such as anomaly detection, intelligent fault diagnosis, and resource‑utilization improvement, and demonstrating how machine‑learning models like ARMA, FFT, and SVM have transformed operational efficiency, availability, and cost.

Efficient Ops

Jul 31, 2018

How Ctrip Boosted Efficiency with AIOps: Real-World AI Operations Cases

AIOps Overview

With the rise of artificial intelligence, Ctrip's production environment entered a new AIOps era, achieving notable gains in efficiency, availability, and cost after more than two years of investment and practice.

Artificial intelligence is divided into weak AI, which excels at specific tasks, and strong AI, which possesses human‑level perception and reasoning; current operational applications rely on weak AI. Successful AI deployment requires three essentials: algorithms, compute power, and data.

Operations generate massive monitoring and alarm data, providing ideal conditions for AI. Gartner introduced the term AIOps (Algorithmic IT Operations) in 2016, emphasizing big‑data and machine‑learning‑driven automation in monitoring and service‑desk scenarios.

Typical AIOps Application Scenarios at Ctrip

The most mature scenarios fall into two categories: availability assurance (anomaly detection, intelligent fault diagnosis, fault prediction, automated remediation) and cost optimization (capacity planning, resource‑utilization improvement, performance tuning).

1. Application Anomaly Detection

Traditional fixed‑threshold alerts suffer from high false‑positive/negative rates, inability to capture gradual trends, and prohibitive maintenance costs across thousands of services.

By building an ARMA model and applying a 3‑sigma rule, Ctrip derives dynamic thresholds that adapt to statistical characteristics of each metric, dramatically improving precision (≈90%) and recall (≈100%).

Normal distribution and dynamic threshold

Dynamic threshold detection visualizes original series (yellow), ARMA‑smoothed series (green), and the 3‑sigma upper bound (red), clearly highlighting outliers.

Periodic pattern detection uses autocorrelation, filtering, and Fast Fourier Transform (FFT) to uncover seasonal cycles, enabling proactive capacity planning.

2. Intelligent Fault Diagnosis

As Ctrip's architecture grew more complex, manual fault isolation became impractical. Faults can arise in network, servers, load balancers, databases, caches, applications, or third‑party services, often propagating through call chains.

Ctrip aggregates alerts, deployment, change, and configuration data, then applies factor analysis, correlation analysis, decision‑tree, Markov‑chain, call‑chain (area algorithm), and expert‑knowledge methods. Scores derived from correlation coefficients or Bayesian formulas identify the most probable root cause.

Such intelligent diagnosis reduces mean time to resolution from tens of minutes or hours to a few minutes.

3. Resource Utilization Improvement

Optimizing resource usage lowers operational cost while meeting business and security requirements.

Application Profiling

Using K‑means, EM and other clustering algorithms on monitoring metrics, release history, and usage patterns, Ctrip classifies applications into CPU‑intensive, memory‑intensive, I/O‑intensive, latency‑sensitive, and frequently‑deployed groups, assigning confidence scores.

Online/Offline Co‑location

During off‑peak hours, idle online resources are allocated to batch Hadoop/Spark jobs, achieving multi‑fold resource utilization gains and reducing expensive offline resource procurement.

Intelligent Elastic Scaling

Models built with SVM on CPU, memory, and network metrics predict capacity surplus or shortage. The Spring elastic scaling platform automatically provisions or de‑provisions resources without human intervention, cutting labor costs and improving utilization.

Conclusion

The convergence of Dev and Ops created DevOps; the fusion of AI and Ops gave rise to AIOps, which replaces rule‑based operations with machine‑learning‑driven decisions. Although still evolving, AIOps is poised to become indispensable for handling increasingly complex operational scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

resource optimization AIOps Fault diagnosis

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.