Artificial Intelligence 20 min read

AIOps Practices and Exploration at Ctrip: Challenges, Solutions, and Future Outlook

This article presents Ctrip's extensive AIOps exploration, detailing operational challenges caused by massive monitoring data, the evolution of DevOps practices, the design of intelligent anomaly detection and diagnosis systems, practical use cases, and a forward‑looking perspective on the future of AI‑driven operations.

Ctrip Technology
Ctrip Technology
Ctrip Technology
AIOps Practices and Exploration at Ctrip: Challenges, Solutions, and Future Outlook

Ctrip operates a massive, complex architecture that generates enormous volumes of operational data (logs, metrics, application information), making traditional rule‑based monitoring insufficient, especially in the era of big data.

The author, a senior SRE at Ctrip, shares the company's AIOps journey, aiming to give readers a macro view of industry development and practical insights for those interested in AIOps.

1. Operational Challenges – Rapid growth of data size, high cost of searching and retrieving metrics, and the difficulty of balancing data value against storage cost.

2. Understanding AIOps – Defined as a cross‑domain technology introduced in 2016, with rapid adoption in 2018. It focuses on quality, efficiency, and cost, covering anomaly detection, self‑healing, and capacity optimization. The evolution of Ctrip's operations moved from manual scripts to tool‑based, automated, and now intelligent operations.

2.1 Trends in Operations – Transition from script‑based to tool‑based, then to end‑to‑end automation, and currently to intelligent operations powered by AI.

2.2 Personnel Shift – Roles now include Operations Engineer, Operations Development Engineer, and Operations AI Engineer, often combined in hybrid talent.

2.3 AIOps Status – Still in an early stage, mainly addressing single‑application scenarios.

2.4 Development Challenges – Requires deep knowledge of both operations and AI, with data quality, algorithm maturity, and scarcity of hybrid talent as major hurdles.

3. Ctrip's AIOps Exploration

3.1 Monitoring Time‑Series Anomaly Detection – Replaces rule‑based alerts with machine‑learning models, using data‑source configuration, dataset filtering, a collection of anomaly detection algorithms, an alarm state machine, and alarm quality evaluation.

3.1.1 Anomaly Detection Algorithms – Includes supervised, unsupervised, and semi‑supervised learning, as well as parametric and non‑parametric models; many algorithms are customized for streaming data.

3.1.2 Statistical‑Based Detection – Uses distribution characteristics (e.g., 3‑Sigma rule, quartile range) and time‑frequency transformation (FFT) to build dynamic baselines and identify periodic patterns.

3.2 Intelligent Diagnosis of Application Alarms – Constructs factor analyzers (release, config change, call‑chain anomalies, etc.) and scores correlations using Pearson coefficients and Bayesian posterior probabilities, aggregating related alarms across services.

3.3 Summary of Diagnosis – Demonstrates rapid fault root‑cause localization, highlights the importance of knowledge‑base construction, factor analysis, and feedback loops for improving detection quality.

4. Future Outlook – AI is a tool for operations; successful AIOps requires scenario‑driven adoption, continuous learning from industry best practices, and collaboration between academia (algorithms) and industry (data, use cases). The vision is toward fully autonomous, “unattended” operations.

monitoringmachine learningoperationsanomaly detectionAIOpstime seriesFourier transform
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.