Artificial Intelligence 16 min read

AIOps at Ctrip: Concepts, Typical Application Scenarios, and Algorithmic Practices

This article introduces Ctrip's AIOps journey, explaining the AI‑driven operations concept, showcasing typical use cases such as anomaly detection, intelligent fault diagnosis, and resource utilization improvement, and detailing the underlying statistical and machine‑learning algorithms that enable these capabilities.

Ctrip Technology

Jun 19, 2018

AIOps at Ctrip: Concepts, Typical Application Scenarios, and Algorithmic Practices

Author Bio Xu Xinlong, senior engineer in Ctrip's Technical Assurance Center, leads multiple AIOps projects and holds a master’s degree in signal processing with strong interests in AI, machine learning, neural networks, and their application to operations.

With the rise of the AI era, Ctrip’s production environment operations have entered a new AIOps era. Over more than two years of investment and practice, AIOps has achieved significant gains in efficiency, availability, and cost optimization.

1. Concept of AIOps

In March 2016, AlphaGo’s victory brought AI back into public focus. AI is usually divided into weak AI (task‑specific superiority) and strong AI (human‑level perception and reasoning); current applications are weak AI. Successful AI requires three essentials: algorithms, compute power, and data.

Operations generate massive monitoring and alarm data, providing ideal conditions for AI. Gartner defined AIOps (Algorithmic IT Operations) as using big data and machine learning to drive automation in service desks and monitoring. It can also be viewed as data‑driven operations.

The AIOps team consists of operations engineers, development engineers, and AI engineers who collaborate to automate and enhance operational workflows.

2. Typical AIOps Application Scenarios at Ctrip

Two mature domains are highlighted:

Availability Assurance : anomaly metric detection, intelligent fault diagnosis, fault prediction, automatic remediation.

Cost Optimization : capacity planning, resource utilization improvement, performance tuning.

2.1 Application Anomaly Metric Detection

Traditional fixed‑threshold alerts suffer from manual threshold setting, high false‑positive/negative rates, inability to capture gradual trends, and high maintenance cost across thousands of services.

By building ARMA models on time‑series data and applying a dynamic 3‑sigma rule, Ctrip creates adaptive thresholds that automatically identify anomalies. The approach also leverages alarm correlation and expert knowledge to classify alert types, achieving >90% precision and near‑100% recall.

Dynamic thresholding uses the ARMA‑smoothed series as the statistical center and adds the 3‑sigma bound to form a time‑varying upper limit, making outliers easy to spot.

Periodic detection extracts hidden patterns by transforming the time domain to the frequency domain via FFT, revealing seasonal cycles that guide proactive resource planning.

2.2 Intelligent Fault Diagnosis

Ctrip’s complex architecture spans networks, servers, load balancers, DNS, CDNs, databases, caches, applications, and third‑party services. Faults often cascade through the call chain, creating alarm storms that are hard to triage manually.

By aggregating alerts from monitoring, deployment, change, and configuration systems, and applying factor analysis, correlation analysis, decision trees, Markov chains, call‑chain analysis, and expert knowledge bases, each candidate fault receives a confidence score; the highest‑scoring candidate is identified as the root cause.

Correlation diagnosis computes Pearson coefficients between alarm time‑series and, combined with call‑graph relationships, distinguishes local anomalies from widespread failures.

2.3 Resource Utilization Improvement

To operate more economically, Ctrip applies machine‑learning techniques:

Application Profiling : Using K‑means, EM clustering on monitoring, release, and usage metrics to label services (CPU‑intensive, memory‑intensive, I/O‑intensive, latency‑sensitive, frequently‑released). Co‑locating similar services on the same host raises overall utilization.

Online/Offline Co‑location : During off‑peak hours, idle online resources are allocated to batch Hadoop/Spark jobs, achieving multi‑fold resource gains.

Intelligent Elastic Scaling : An SVM‑based capacity model predicts CPU, memory, and network demand; the internal Spring platform automatically provisions or de‑provisions resources without human intervention.

3. Conclusion

The convergence of Dev and Ops created DevOps, reducing repetitive work and labor costs. The marriage of AI and Ops birthed AIOps, which replaces rule‑based operations with machine‑learning and statistical methods, enabling smarter analysis and decision‑making on massive operational data. Although still evolving, AIOps is poised to become indispensable for handling increasingly complex operational challenges.

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.