AIOps at Meituan: Architecture, Design, and Practice of the Horae Time‑Series Anomaly Detection System
This article presents Meituan's AIOps exploration, focusing on the design and implementation of the Horae time‑series anomaly detection platform, covering background, technical roadmap, fault‑discovery workflow, time‑series classification, feature engineering, model training, real‑time detection, and future directions.
Background – Traditional manual operations cannot keep up with rapid internet growth and high labor costs, prompting the shift from rule‑based automation to AIOps, which applies AI and machine learning to massive operational data (logs, metrics, events) for intelligent, near‑zero‑touch operations.
Technical Roadmap – Meituan plans a staged AIOps capability: initial AI pilots, single‑scenario AI components, integrated AI workflows, and finally a core AI engine that balances cost, quality, and efficiency across business lifecycles.
Fault‑Discovery Focus – Among fault‑management capabilities (discovery, alert delivery, localization, recovery), fault discovery—especially automatic time‑series anomaly detection—was chosen as the first target because it can dramatically reduce manual alert rule configuration and improve detection accuracy.
Time‑Series Classification – Three data types (periodic, stable, irregular) are identified. Classification methods evaluated include single classifiers (SVM, DBSCAN, One‑Class‑SVM), ensemble voting (boosting accuracy to 87 %), and a CNN‑based classifier achieving >95 % accuracy without extensive feature engineering.
Classification Process – Steps: missing‑value imputation, variance standardization, dimensionality reduction (PAA preferred over SAX for 144‑dimensional representation), and CNN model training on labeled samples.
Periodic‑Metric Anomaly Detection – Uses supervised learning (sample labeling, feature extraction, XGBoost model) to detect anomalies in periodic metrics. An automatic anomaly‑injection algorithm creates balanced training data by randomly inserting upward or downward spikes with ripple effects.
Feature Engineering – Combines statistical, fitting, and correlation features; then abstracts them with an Isolation Forest layer to improve model generalization.
Model Training & Real‑Time Detection – Offline training generates classification models stored for inference. Real‑time detection pipelines first perform pre‑filtering, then extract features, apply the trained model, and emit alerts via message queues. User feedback is fed back into the training set for continuous improvement.
Special‑Scenario Optimizations – Tailored strategies address low‑peak volatility (larger comparison windows), holiday periods (multi‑metric isolation‑forest checks and sensitivity reduction), overall upward/downward trends (threshold‑based checks), and shifted periodic patterns (DTW distance with isolation‑forest outlier detection).
Horae System Architecture – Horae consists of four modules: Data Ingestion (topic‑based time‑series collection into an Elasticsearch‑backed TSDB), Real‑Time Detection (per‑point anomaly scoring), Experiment Module (sample management, algorithm registration, workflow orchestration, model training/evaluation), and Algorithm Module (pre‑processing, feature extraction, ML models such as RF, SVM, XGBoost, CNN, clustering, anomaly detectors, predictors, and custom algorithms).
Algorithm Registration & Workflow Orchestration – Algorithms are registered with metadata (type, interface, parameters). Workflows are built by chaining algorithm components into execution or training pipelines, allowing automated hyper‑parameter tuning (Bayesian optimization) and model training (XGBoost, CNN, etc.).
Training & Execution Pipelines – Training pipelines handle parameter search and model fitting on labeled or synthetically generated samples; execution pipelines apply the trained models to streaming data, producing anomaly scores and alerts. Periodic retraining ensures adaptation to evolving data distributions.
Evaluation Results – On 28 000 labeled samples (75 % train, 25 % test), the CNN classifier achieved 94 % accuracy and 89 % recall. The end‑to‑end detection workflow for periodic metrics reached >90 % precision in production, outperforming legacy shape‑analysis methods.
Conclusion & Outlook – Horae demonstrates that AI‑driven fault discovery can be reliably deployed at scale. Future work includes extending detection to stable and irregular metrics, adding forecasting capabilities, improving alert‑convergence, and enhancing fault‑localization with knowledge‑graph techniques.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.