Operations 29 min read

How Ctrip Leverages AI to Revolutionize Application Operations: AIOps Practices and Insights

This article details Ctrip's journey of applying AI-driven AIOps to address application operation pain points, describing their evolution from manual scripts to intelligent automation, the implementation of anomaly detection, smart diagnosis, online/offline mixed deployment, and future considerations for scalable, cost‑effective operations.

dbaplus Community
dbaplus Community
dbaplus Community
How Ctrip Leverages AI to Revolutionize Application Operations: AIOps Practices and Insights

Background

Large‑scale online services face three recurring operational problems: low‑quality alerts that generate false or missed alarms, slow fault‑diagnosis that prolongs outage recovery, and missing or incomplete application profiling that hampers resource‑aware scheduling. These issues affect service quality, operational efficiency, and cost control.

Evolution of Operations at Ctrip

The operational model progressed through four stages: (1) manual script‑based operations, (2) tool‑centric operations for specific scenarios, (3) end‑to‑end automated delivery, and (4) the current intelligent‑operations era where most tasks are automated but rule‑based methods struggle with the growing data volume.

AIOps Practice Overview

Ctrip’s AIOps platform addresses the three pain points by applying machine‑learning techniques in three dimensions: intelligent alerting, smart fault diagnosis, and online/offline mixed deployment.

1. Intelligent Alerting

Data are ingested from heterogeneous sources (monitoring systems, logs, deployment events). Before model inference a data‑quality filter evaluates statistical metrics – mean, variance, skewness, and information entropy – and discards low‑quality series.

For each metric a stable baseline is generated using a time‑frequency analysis (Fast Fourier Transform). The baseline is computed by applying a window function to the raw series, performing an FFT, zero‑filtering high‑frequency components, and inverse‑transforming to the time domain. The resulting baseline is used to calculate dynamic thresholds:

# Example (Python‑like pseudocode)
import numpy as np
from scipy.fft import fft, ifft

def baseline(series, win=256):
    # Apply Hann window
    w = np.hanning(len(series))
    s = series * w
    # FFT
    freq = fft(s)
    # Zero out frequencies above cutoff (e.g., 0.1 * Nyquist)
    cutoff = int(0.1 * len(freq))
    freq[cutoff:-cutoff] = 0
    # Inverse FFT
    base = np.real(ifft(freq))
    return base

def dynamic_threshold(base, sigma=3):
    mu, std = np.mean(base), np.std(base)
    lower = mu - sigma * std
    upper = mu + sigma * std
    return lower, upper

Both Gaussian (3‑sigma) and non‑Gaussian (quantile‑based) thresholds are supported. Anomaly detection algorithms include:

Supervised models – Gradient Boosting Decision Trees (e.g., XGBoost), Random Forest, Neural Networks.

Unsupervised models – Isolation Forest, One‑Class SVM, DBSCAN clustering for anomaly region discovery.

Detected anomalies are fed into an alert state machine that aggregates consecutive outliers, suppresses duplicate notifications, and emits a business‑level alarm.

2. Smart Fault Diagnosis

When an alarm is raised, a set of factor analyzers evaluates possible root causes:

Correlation scoring : Pearson correlation coefficient between the alarm metric time series A(t) and candidate event series B(t).

r = cov(A, B) / (std(A) * std(B))
score = int(r * 100)   # 0‑100 scale

Bayesian inference : posterior probability of a root‑cause event E given an alarm A . P(E|A) = P(A|E) * P(E) / P(A) Historical event‑alarm pairs populate the likelihood P(A|E) and priors P(E).

Feature‑based matching : error‑message text is tokenized, TF‑IDF weighted, and matched against a labeled knowledge base using cosine similarity. The top‑k matches are returned as candidate causes.

The three scores are combined (weighted sum) to produce a final root‑cause ranking, typically within seconds.

3. Online/Offline Mixed Deployment

Application, resource, and batch‑job profiling creates three “portrait” vectors:

Application portrait : usage patterns of CPU, memory, I/O, and business‑level metrics.

Resource portrait : aggregated hardware metrics of each online node.

Job portrait : expected resource consumption and latency tolerance of batch jobs (e.g., Hadoop, Spark).

Clustering (K‑means or hierarchical) groups similar applications and resources. During off‑peak periods ( e.g. , night hours) the scheduler selects low‑priority batch jobs whose job portrait aligns with under‑utilized online nodes (CPU idle > 30 %). A rule engine enforces isolation:

CPU and network I/O limits are set via cgroups or container quotas.

If an online service’s latency or error rate exceeds a dynamic alert threshold, the scheduler automatically evicts the batch job from that node.

In production this strategy increased online CPU utilization by ~230 % and reduced offline cluster capacity requirements by roughly 45 %.

Key Technical Techniques

Time‑frequency analysis (FFT) for baseline generation and high‑frequency noise removal.

Dynamic thresholding based on 3‑sigma for Gaussian‑like metrics and quantile‑based thresholds for heavy‑tailed distributions.

Clustering (K‑means, DBSCAN) for resource and application profiling, enabling fine‑grained scheduling decisions.

Pearson correlation and Bayesian inference for root‑cause scoring.

Text‑feature extraction (TF‑IDF, cosine similarity) for error‑message matching.

Alert state machine to de‑duplicate, aggregate, and route alarms.

Implementation Artifacts

The platform is built on a micro‑service architecture. Core services include:

Ingestion Service : pulls metric streams from Prometheus, OpenTSDB, or custom agents.

Quality Filter Service : computes statistical descriptors and discards noisy series.

Baseline Service : runs FFT‑based baseline computation in batch or streaming mode.

Detection Service : hosts supervised/unsupervised models; exposes a REST API for real‑time scoring.

Diagnosis Service : implements correlation, Bayesian, and text‑matching modules; stores knowledge base in a relational DB.

Scheduler Service : integrates with YARN / Kubernetes to place batch jobs on selected online nodes.

All services expose OpenAPI specifications and are containerized for easy deployment. Source code and CI/CD pipelines are hosted in the internal GitLab repository gitlab.ctrip.com/ops/aiops-platform (clone URL: [email protected]:ops/aiops-platform.git).

Evaluation and Outcomes

After deployment:

Alert precision and recall improved by > 30 % compared with static‑threshold alerts.

Mean time to identify root cause dropped from tens of minutes to < 5 seconds for most incidents.

Online CPU utilization increased by ~230 %; offline job throughput grew by ~45 % due to mixed deployment.

Key challenges that remain include:

Ensuring data quality for unlabeled metric streams.

Selecting the most appropriate anomaly‑detection algorithm per metric type.

Cross‑data‑center scheduling where network bandwidth becomes the bottleneck.

Future Directions

Continued work focuses on:

Closing the feedback loop by automatically feeding post‑mortem analysis results back into the knowledge base.

Exploring semi‑supervised learning to reduce labeling effort.

Extending mixed deployment to heterogeneous workloads (e.g., GPU‑accelerated jobs).

Overall, Ctrip’s AIOps implementation demonstrates that a systematic combination of statistical baselines, machine‑learning models, and resource‑aware scheduling can substantially improve operational quality, speed, and cost efficiency.

Application lifecycle
Application lifecycle
Anomaly detection workflow
Anomaly detection workflow
Statistical distribution
Statistical distribution
Mixed‑deployment architecture
Mixed‑deployment architecture
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

machine learninganomaly detectionaiopsapplication monitoringOnline/Offline Deployment
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.