AIOps for Incident Management: Practices and Insights from Meituan
Meituan’s service‑operations team applies AIOps across prevention, detection, and post‑incident stages—using change‑risk analysis, real‑time graph‑based anomaly detection, similarity‑driven root‑cause diagnosis, and NLP‑powered incident recommendation—to achieve sub‑second detection, high precision, 28% faster fault handling, and plans for intelligent log and change recognition.
0 Writing Preface
Incidents include not only failures but also alerts and anomalies in operations.
"An incident is an unplanned interruption to an IT Service or a reduction in the Quality of an IT Service." – ITIL
1 Background
Meituan’s service operations team explores AIOps across prevention, detection, and post‑operation stages. The article builds on previous work on anomaly detection (Horae) and shares two years of practice in incident management.
Incident management is complex due to massive, diverse, real‑time data and intricate workflows.
Data diversity: alerts, links, metrics, logs, changes, etc.
Real‑time, relational data: link‑metric coupling, etc.
Strong domain knowledge required.
The workflow is also multi‑step, requiring efficiency at each stage.
2 AI Capability Overview in Incident Management
AIOps provides a capability framework covering risk prevention, fault detection, incident handling, and post‑incident operation.
3 AIOps Incident Management Scenarios
3.1 Pre‑incident Prevention
Change risk detection is performed in three phases (pre‑, mid‑, post‑change). Pre‑change risk warnings have high ROI but limited reference data; mid/post‑change detection leverages gray‑release metrics for higher accuracy.
Pre‑change : Configuration change risk is assessed by mining historical legal change patterns and applying constraint rules (structure, delimiter, consistency).
Mid/Post‑change : Detect anomalies caused by gray‑release metric shifts (e.g., QPS, 4XX, 5XX). Reference groups are filtered using an adaptive DBSCAN clustering to remove outliers, then anomaly detection is applied.
Deployed in MCM, the system flags abnormal QPS/4XX/5XX with visual markers and detailed panels, allowing users to label false positives.
3.2 Mid‑incident Fast Recovery
Key metrics: MTTD (Mean Time to Detect), MTTT (Mean Time to Trace), MTTR (Mean Time to Repair).
3.2.1 Anomaly Detection
A similarity‑based time‑series algorithm identifies point, context, and pattern anomalies by comparing a candidate point with historical neighbors.
The pipeline includes pre‑detection (filtering normal points), feature extraction, model classification, and feedback‑driven model improvement, achieving >98% precision and recall on core metrics.
3.2.2 Root‑Cause Diagnosis
Meituan’s Radar platform builds real‑time service call graphs and applies large‑scale anomaly detection (millions of links per minute) using pre‑trained models and adaptive clustering. Detection latency is 1.5‑3 ms with ~81% precision/recall.
Multi‑modal data and rule engines recommend loss‑prevention plans.
3.2.3 Similar Incident Recommendation
Historical radar events are vectorized (separate models for structured and textual data) using tokenization, TF‑IDF weighting, and NLP embeddings. Real‑time recommendation matches new events to top‑k similar historical cases, re‑ranked by features such as text richness, timeliness, root‑cause match, and alarm similarity.
Offline, events are split, tokenized, and stored as two TF‑IDF vectors. Online, similarity scores are combined with rule‑based features to produce a final recommendation score. The system yields ~70% coverage of similar cases and reduces average fault handling time by 28% with ~76% recommendation accuracy.
3.3 Post‑incident Operation
COE (Correction Of Error) records fault post‑mortems. Topic modeling and NLP enable thematic display and similar‑case recommendation to aid knowledge reuse.
4 Summary and Future Outlook
The article summarizes Meituan’s AIOps applications across pre‑prevention, mid‑handling, and post‑operation, and outlines future directions such as intelligent log detection and smart change recognition.
Intelligent Log Detection
Four modules: collection, parsing, feature extraction, and anomaly detection. Two research tracks: template‑time‑series anomalies (e.g., using Drain3) and template‑semantic anomalies (ML/DL).
Intelligent Change Recognition
Detect configuration errors by learning key‑value change patterns from historical normal changes and matching new changes against learned distributions.
References
Ester et al., “A Density‑Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” AAAI Press, 1996.
Li et al., “Generic and Robust Localization of Multi‑dimensional Root Causes,” IEEE ISSRE, 2019.
He et al., “Drain: An Online Log Parsing Approach with Fixed Depth Tree,” IEEE ICWS, 2017.
Du & Li, “Spell: Streaming Parsing of System Event Logs,” IEEE ICDM, 2016.
IBM, Drain3, https://github.com/IBM/Drain3
Akiko A., “An information‑theoretic perspective of tf–idf measures,” Information Processing and Management, 2003.
David B., Andrew Ng, Michael J., “Latent Dirichlet Allocation,” JMLR, 2003.
曹臻, 威远, “基于AI算法的数据库异常监测系统的设计与实现.”
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
