AIOps at Meituan: Architecture and Practice of Time‑Series Anomaly Detection (Part 1)
Meituan’s AIOps initiative replaces manual rule‑based monitoring with the Horae platform, which automatically classifies time‑series metrics, applies CNN and XGBoost models to detect periodic anomalies, achieves over 90 % precision in production, and paves the way for broader metric types, forecasting, and advanced fault‑localization.
AIOps, originally defined as Algorithm IT Operations, uses operational algorithms to achieve automated and eventually unmanned IT operations. As the technology matured, it was re‑named Artificial Intelligence for IT Operations, applying AI techniques to the operational domain by leveraging existing operational data (logs, monitoring metrics, application information, etc.) and machine‑learning methods to solve problems that pure automation cannot address.
This article presents the first part of Meituan’s exploration of AIOps, focusing on automatic fault discovery. It introduces the architecture and design of the time‑series anomaly detection system Horae .
1. Background
Traditional operations were largely manual, which became unsustainable as internet services expanded rapidly and labor costs rose. Automated operations emerged, using triggerable scripts and predefined rules to reduce human effort. However, rule‑based expert systems struggle with the growing complexity and scale of services. DevOps partially alleviates this by promoting end‑to‑end value delivery, but AIOps goes further on the operations side by replacing hand‑crafted rules with machine‑learning models that continuously learn from massive operational data (events, human‑generated logs, etc.). AIOps therefore requires three knowledge domains: industry/business knowledge, operational knowledge (monitoring, anomaly detection, fault handling, cost optimization, capacity planning, performance tuning), and algorithm/machine‑learning knowledge.
Meituan’s technical teams have accumulated extensive experience in these areas and have built a series of tools and products for automated operations. By continuously investing in AIOps, they aim to embed this knowledge into intelligent operational workflows, thereby improving productivity across development, product, and operations teams.
2. Technical Roadmap
2.1 AIOps Capability Building – The evolution proceeds from isolated AI pilots to a fully integrated AI‑driven operations platform, eventually forming a core AI hub that can balance cost, quality, and efficiency across business lifecycles.
2.2 Team Structure – The AIOps effort is divided among three groups: SRE (responsible for extracting intelligent requirements from operational scenarios), development engineers (building platform features and reducing user friction), and algorithm engineers (researching and implementing machine‑learning solutions). The relationship among these groups is illustrated in Figure 2.
2.3 Evolution Path – Fault management is the initial focus, covering four core capabilities: fault discovery, alarm delivery, fault localization, and fault recovery. Figure 3 shows the relationship among these capabilities.
3. Fault Discovery
3.1 Overview – Most of Meituan’s monitoring data are time‑series metrics. Existing systems (CAT, MT‑Falcon, Digger, Radar, etc.) rely on fixed‑threshold rules, which cannot cope with the diversity and volume of metrics.
3.2 Automatic Time‑Series Classification – Metrics are categorized into three types: periodic, stable, and irregular. Classification methods include unsupervised clustering (e.g., Yading, DBSCAN) and supervised learning (e.g., SVM, Logistic Regression). After experiments, a CNN‑based classifier achieved >95% accuracy and was adopted for production.
3.3 Periodic Metric Anomaly Detection – For the most common periodic metrics, a supervised learning pipeline is employed:
Data labeling and sample generation.
Feature extraction (statistical, fitting, and correlation features).
Model training using XGBoost.
Real‑time detection with pre‑filtering to reduce computational load.
Special scenarios such as low‑peak periods, holidays, overall upward/downward trends, and phase‑shifted patterns are handled with tailored strategies (window expansion, holiday‑specific thresholds, multi‑metric correlation, DTW‑based similarity detection, etc.).
3.4 Platformization – The resulting detection logic is encapsulated in the Horae system, which provides a modular pipeline consisting of data ingestion, real‑time detection, experimentation, and algorithm modules. Users register data sources, configure detection flows, and receive anomaly alerts via message queues. Figures 17‑21 illustrate the system architecture, workflow composition, and training results.
3.5 Results – On a test set of 28,000 labeled samples, the classification model achieved 94% accuracy and 89% recall. In production, the periodic‑metric detection pipeline reaches >90% precision and outperforms traditional shape‑analysis methods.
4. Conclusion and Outlook
Time‑series anomaly detection is a core component of AIOps fault discovery. Meituan has successfully deployed a solution for periodic metrics and plans to extend it to stable and irregular metrics, add forecasting capabilities, and further improve detection performance. Future work will also address alarm convergence, noise reduction, and more comprehensive fault‑localization techniques (knowledge graphs, root‑cause analysis).
5. References
[1] Zhou Zhihua. Machine Learning: Development and Future. 2016. [2] Meituan CAT Monitoring System. https://tech.meituan.com/CAT_in_Depth_Java_Application_Monitoring.html [3] Meituan MT‑Falcon Monitoring System. https://tech.meituan.com/Mt-Falcon_Monitoring_System.html [4] Ding et al., “Yading: Fast clustering of large‑scale time series data”, VLDB Endowment, 2015. [5] Paparrizos & Gravano, “k‑shape: Efficient and accurate clustering of time series”, SIGMOD 2015. [6] Ren et al., “Time‑series anomaly detection service at Microsoft”, KDD 2019. [7] Brander, “Time series classification with Tensorflow”, 2017. [8] Liu et al., “Opprentice: Towards practical and automatic anomaly detection through machine learning”, IMC 2015. [9] Metis – Learnware platform for AIOps, https://github.com/Tencent/Metis [10] Li Hang, “Statistical Learning Methods”, 2nd ed., 2019. [11] Curve – Tool for labeling time‑series anomalies, https://github.com/baidu/Curve
6. Authors
Hu Yuan, Jin Dong, Jun Feng (Infrastructure Technology – Service Operations); Chang Wei, Yong Qiang (Delivery Business Group – Transaction System Platform).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
