How Meituan’s Database Autonomy Service Tackles Scale and Reliability Challenges
This article outlines the evolution of Meituan’s Database Autonomy Service (DAS), describing the growing scale‑vs‑operations imbalance, the strategic roadmap for self‑service and AI‑driven diagnostics, detailed architectural designs across data collection, compute/storage, and analysis layers, and the measurable outcomes and future plans for full database autonomy.
Current Situation
Rapid growth of Meituan database instances created a pronounced imbalance between scale and operational capability, leading to frequent failures and long MTTR. Analysis of past incidents shows that about 80% of MTTR is spent on manual analysis and定位.
Solution Overview
Short‑term focus is to reduce analysis time (high ROI). Long‑term vision is to build a platform that enables self‑service, automation, and AI‑assisted root‑cause analysis, forming a virtuous “flywheel” for operations.
Architecture Evolution
The system evolves through four stages: platformization → self‑service → intelligence (expert rules + AI) → automation.
Data Collection Layer
Data quality is critical; the design balances collection fidelity with instance stability. A temporary pcap‑based packet capture solution is used before migrating to kernel‑level agents.
Impact tests on MySQL instances show negligible performance degradation.
Compute & Storage Layer
Design principles include full‑memory computation, raw data reporting with aggressive compression, controlled memory usage, and minimal impact on MySQL instances.
Full‑SQL statements are aggregated per minute using the key RDS_IP_DBName_SQL_Template_ID_Time_Minute.
Compression pipeline (message compression → pre‑aggregation → dictionary → minute‑level aggregation) reduces bandwidth and storage.
Delayed messages beyond one minute are placed into a compensation queue (Mafka) to avoid data loss.
Analysis & Decision Layer
Root‑cause inference combines expert rules (GRAI methodology) with AI models. The roadmap progresses from rule‑only to AI‑dominant stages.
Rule‑Based Approach – Using the GRAI framework, reliable rules (e.g., master‑slave latency) are extracted through continuous replay and refinement.
AI‑Based Detection – Historical metric modeling (pre‑processing, feature extraction, classification) is serialized for Flink streaming detection.
Evaluation Metrics
A closed‑loop process (alert → replay → root‑cause analysis → improvement → verification) has improved both recall and precision of root‑cause detection.
Key User Cases
Integration with Meituan’s internal IM (DaXiang) enables automatic alert push, root‑cause view, one‑click remediation, and feedback tracking.
Future Outlook
After two years, DAS has solidified core capabilities (pre‑deployment SQL risk detection, index‑change safety checks). Future work focuses on three directions:
Enhance compute and storage capacity to support 3‑5 years of growth.
Roll out autonomous SOP workflows in three stages: (1) link root‑cause diagnosis to SOP documents; (2) platformize SOP management; (3) automate low‑risk actions to achieve full database autonomy.
Build a flexible anomaly replay system for continuous model validation and improvement.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
