How Meituan Built a Scalable Autonomous Database System to Slash MTTR
This article details Meituan's journey from rapid database growth and operational bottlenecks to a multi‑year roadmap that combines platform‑level monitoring, rule‑based and AI‑enhanced root‑cause analysis, and automated remediation, ultimately delivering measurable improvements in alert accuracy, recall rates, and overall database reliability.
Background
Rapid growth of Meituan’s MySQL instances outpaced DBA capacity, causing long MTTR for incidents. Incident analysis showed that ~80% of failure time was spent on diagnosis, highlighting the need for automated data collection, observability, and root‑cause identification.
Solution Overview
The team defined a four‑phase evolution: short‑term ROI by improving alert analysis, medium‑term self‑service platform, long‑term AI‑driven automation, and full autonomy. The design combines expert rules (GRAI methodology) with AI models for anomaly detection and root‑cause classification.
Technical Architecture
Top‑Level Design
Four evolutionary steps – platformization, self‑service, intelligence, automation – are illustrated in the layered diagram.
Data Collection Layer
Initially a pcap‑based packet capture was used to avoid kernel‑level dependencies; a future transition to kernel agents is planned. The agent runs with minimal impact on MySQL instances, focusing on high‑quality data ingestion.
Compute & Storage Layer
All computation is performed in‑memory within a single thread/process to maximize throughput.
Raw data is reported with aggressive compression; memory usage is bounded to prevent overflow.
Design principles: full‑in‑memory computation, raw data reporting, extreme compression, and controlled memory consumption.
Analysis & Decision Layer
Two complementary approaches:
Rule‑based GRAI : expert knowledge distilled into actionable rules (e.g., master‑slave latency detection).
AI models : Flink‑driven streaming detection builds offline models from historical metrics, serializes them, and applies them in real‑time to detect anomalies.
The pipeline includes data preprocessing, feature extraction (time‑series, text, domain‑specific), classification, root‑cause ranking, and expansion (SQL behavior analysis, log correlation, dimension drill‑down).
Key Technical Details
Agent Impact Evaluation
Performance tests measured CPU, memory, and latency overhead on MySQL instances, confirming negligible impact.
Full‑SQL Aggregation
To handle massive SQL traffic, messages are aggregated per minute using the key:
AggKey = RDS_IP + '_' + DBName + '_' + SQL_Template_ID + '_' + MinuteTimestampAggregated data is then compressed through message compression, pre‑aggregation, dictionary encoding, and minute‑level aggregation.
Compensation Mechanism
Messages delayed beyond one minute are placed into a compensation queue (Mafka) to avoid data loss during traffic spikes. The compensated data is later merged to ensure completeness for downstream analysis.
Root‑Cause Classification Pipeline
Data Collection : performance metrics, status snapshots, system metrics, logs, and hardware signals.
Feature Extraction : time‑series features, textual features, and domain‑specific features derived from database knowledge.
Root‑Cause Classification : preprocessing → feature selection → model inference → ranking.
Root‑Cause Expansion : deeper analysis such as SQL behavior, expert rule correlation, and log mining.
Results
Metrics show steady improvement in user‑feedback accuracy and overall root‑cause recall after deploying the platform.
Future Plans
Scale compute and storage capacity to support 3‑5 years of growth.
Gradually roll out autonomous capabilities for low‑risk scenarios.
Build a flexible anomaly‑replay system to continuously validate and improve root‑cause algorithms.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
