Database Autonomy Service (DAS): Architecture, Design, and Implementation
The Database Autonomy Service (DAS) is a platform that uses big‑data, machine‑learning, and expert knowledge to automatically collect, compress, and analyze MySQL metrics, providing self‑service fault detection, root‑cause diagnosis, and security management, thereby reducing manual effort, shortening MTTR, and supporting Meituan’s rapid database growth.
DAS (Database Autonomy Service) is a platform-oriented service for developers and DBAs that provides database performance analysis, fault diagnosis, and security management. It leverages big‑data techniques, machine learning, and expert knowledge to reduce the complexity of database operations and minimize human‑induced failures.
1. Current Situation and Problems
Rapid growth of Meituan’s business has led to a fast‑increasing number of database instances. The imbalance between instance scale and operational capability causes a surge in incidents, making manual analysis and troubleshooting unsustainable. The article shows a growth trend chart of database instances (Fig. 1).
High stability requirements and missing key metrics force reliance on DBA‑hand‑on troubleshooting, resulting in long MTTR. The team therefore aims to provide self‑service or automated fault location to shorten handling time.
2. Solution Thinking
The proposed strategy addresses both short‑term contradictions and long‑term development:
Short‑term: automate frequent operational tasks, provide self‑service tools.
Long‑term: build a solid foundation, empower upper‑level services, and achieve database autonomy.
A scientific evaluation system is introduced to continuously track product quality, using controllable input metrics and output metrics to guide improvements.
3. Technical Scheme
3.1 Top‑Level Architecture Design
The architecture follows a four‑step evolution: platformization → self‑service → intelligence (expert knowledge + AI) → automation. The top‑level diagram (Fig. 7) illustrates the progression from current 2021 status to the envisioned future.
3.2 Data Collection Layer
The collection layer is the foundation; it must guarantee data quality while preserving instance stability. A hybrid approach is adopted: short‑term packet capture (pcap) is used as a transition before kernel‑level collection becomes widespread. The Agent design (Fig. 8) and its impact on databases (Fig. 9) are presented.
3.3 Compute & Storage Layer
All‑in‑memory computation to achieve high performance.
Report raw data with aggressive compression.
Memory consumption is carefully controlled to avoid OOM.
Minimize impact on MySQL instances by postponing heavy computation.
Full‑SQL data is aggregated per minute using a composite key (RDS_IP + DBName + SQL_Template_ID + Minute). Aggregation and compression designs (Figs. 11‑13) reduce bandwidth and storage. A compensation mechanism (Fig. 14) handles delayed messages to ensure completeness.
3.4 Analysis & Decision Layer
The layer combines rule‑based expert knowledge (GRAI methodology) and AI algorithms. Four development stages are defined: pure rule, rule + AI pilot, rule + AI parallel, and AI‑dominant. The AI‑driven anomaly detection pipeline (Fig. 17) includes preprocessing, feature extraction, model training, and Flink‑based streaming detection. Root‑cause diagnosis (still under construction) will subscribe to alerts, collect data, infer causes, and provide actionable recommendations.
4. Achievements
Metrics such as user‑feedback accuracy (Fig. 19) and overall recall rate (Fig. 20) demonstrate continuous improvement. User cases show end‑to‑end alert handling, automatic group creation, and slow‑query optimization suggestions (Figs. 21‑23).
5. Future Outlook
Enhance compute‑storage capacity to support 3‑5 years of growth.
Roll out autonomy in three steps: link root‑cause diagnosis with SOP documents, platformize SOPs, and automate low‑risk scenarios.
Build a flexible anomaly replay system for continuous validation of root‑cause algorithms.
Author
Jin Long, Meituan Basic Technology Department / Database R&D Center / Database Platform Team.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
