How Meituan Uses AIOps to Revolutionize Incident Management
This article details Meituan's two‑year exploration of AIOps for incident management, covering the challenges of massive, real‑time operational data, the AI‑driven modules for risk prevention, fault detection, diagnosis, and similar‑incident recommendation, and future directions such as intelligent log detection and change recognition.
Background
The article extends previous work on AIOps‑driven anomaly detection to cover the full incident‑management lifecycle. Incident management must ingest heterogeneous real‑time data (alerts, traces, metrics, logs, configuration changes) and apply domain knowledge to detect, diagnose, and remediate failures.
AI Capability Overview
Risk prevention : intelligent detection of change‑related risk.
Fault discovery : metric‑level anomaly detection.
Incident handling : root‑cause diagnosis and remediation‑plan recommendation.
Incident operation : recommendation of historically similar incidents.
Change‑Risk Detection (Pre‑prevention)
Change detection is split into three stages: pre‑change, mid‑change, and post‑change. Pre‑change alerts have high business value but limited reference data. The system integrates with the MCM online change‑management platform and applies two main techniques:
Constraint validation : Historical legal change records are mined to generate rules on structure, delimiter, and consistency. New configuration values are checked against these rules.
Adaptive DBSCAN clustering : For mid‑ and post‑change, gray‑scale groups provide reference time‑series. An optimized adaptive DBSCAN removes outlier reference sequences, then the target series is examined for point, contextual, or subsequence anomalies.
Detected anomalies are visualized with red vertical markers, host‑level details, and a “mark as false‑positive” button for feedback.
Metric Anomaly Detection (Fault Discovery)
The detection pipeline consists of:
Pre‑filter : A fast filter discards clearly normal points.
Feature extraction : For remaining points, statistical and temporal features are computed.
Model inference : A machine‑learning classifier predicts anomaly probability.
Closed‑loop feedback : Mis‑classifications are fed back to continuously retrain the model.
In production the system maintains >98 % precision and recall on live data, with weekly sampling keeping core‑metric detection accuracy around 90 %.
Root‑Cause Localization (Incident Handling)
The Radar platform builds a service‑call graph from micro‑service traces, then applies adaptive DBSCAN to prune outlier link sequences. This yields a precise abnormal link sub‑graph used for root‑cause analysis.
Performance on production data:
Precision ≈ 81 %
Recall ≈ 82 %
F1 ≈ 81 %
Processing speed ≈ 1.5‑3 ms per detection, handling millions of link records per minute.
Similar‑Incident Recommendation (Incident Operation)
Historical Radar events are vectorized separately for structured fields and free‑text. Text is tokenized, stop‑words removed, and weighted with TF‑IDF; structured fields are encoded as attribute tokens. In the online stage, a new event undergoes the same processing, and cosine similarity (or other distance) is computed against the stored vectors.
Candidate events are filtered and re‑ranked using the following engineered features:
Text richness : Amount of textual information available.
Timeliness : Temporal distance to the current event.
Root‑cause match : Overlap between diagnosed root causes.
Alarm match : Similarity of alarm lists, with lower weight for generic alarms.
The final score is a weighted sum of similarity scores and the above features. Evaluation on production data shows:
Coverage of similar‑case recommendations ≈ 70 %.
Average fault‑handling time reduced by 28 %.
Recommendation accuracy (ground‑truth root cause match) ≈ 76 %.
Future Directions
Intelligent log detection : Collect logs, parse with tools such as Drain3 (https://github.com/IBM/Drain3), extract count/sequence/semantic features, and apply anomaly detection to template‑time‑series and semantic deviations.
Smart change recognition : Model configuration‑change patterns using statistical and data‑modeling techniques to flag erroneous entries before they affect production.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
