Databases 20 min read

Database Autonomy Service (DAS): Architecture, Design, and Implementation

The Database Autonomy Service (DAS) is a platform that uses big‑data, machine‑learning, and expert knowledge to automatically collect, compress, and analyze MySQL metrics, providing self‑service fault detection, root‑cause diagnosis, and security management, thereby reducing manual effort, shortening MTTR, and supporting Meituan’s rapid database growth.

Meituan Technology Team

May 5, 2022

Database Autonomy Service (DAS): Architecture, Design, and Implementation

DAS (Database Autonomy Service) is a platform-oriented service for developers and DBAs that provides database performance analysis, fault diagnosis, and security management. It leverages big‑data techniques, machine learning, and expert knowledge to reduce the complexity of database operations and minimize human‑induced failures.

1. Current Situation and Problems

Rapid growth of Meituan’s business has led to a fast‑increasing number of database instances. The imbalance between instance scale and operational capability causes a surge in incidents, making manual analysis and troubleshooting unsustainable. The article shows a growth trend chart of database instances (Fig. 1).

High stability requirements and missing key metrics force reliance on DBA‑hand‑on troubleshooting, resulting in long MTTR. The team therefore aims to provide self‑service or automated fault location to shorten handling time.

2. Solution Thinking

The proposed strategy addresses both short‑term contradictions and long‑term development:

Short‑term: automate frequent operational tasks, provide self‑service tools.

Long‑term: build a solid foundation, empower upper‑level services, and achieve database autonomy.

A scientific evaluation system is introduced to continuously track product quality, using controllable input metrics and output metrics to guide improvements.

3. Technical Scheme

3.1 Top‑Level Architecture Design

The architecture follows a four‑step evolution: platformization → self‑service → intelligence (expert knowledge + AI) → automation. The top‑level diagram (Fig. 7) illustrates the progression from current 2021 status to the envisioned future.

3.2 Data Collection Layer

The collection layer is the foundation; it must guarantee data quality while preserving instance stability. A hybrid approach is adopted: short‑term packet capture (pcap) is used as a transition before kernel‑level collection becomes widespread. The Agent design (Fig. 8) and its impact on databases (Fig. 9) are presented.

3.3 Compute & Storage Layer

All‑in‑memory computation to achieve high performance.

Report raw data with aggressive compression.

Memory consumption is carefully controlled to avoid OOM.

Minimize impact on MySQL instances by postponing heavy computation.

Full‑SQL data is aggregated per minute using a composite key (RDS_IP + DBName + SQL_Template_ID + Minute). Aggregation and compression designs (Figs. 11‑13) reduce bandwidth and storage. A compensation mechanism (Fig. 14) handles delayed messages to ensure completeness.

3.4 Analysis & Decision Layer

The layer combines rule‑based expert knowledge (GRAI methodology) and AI algorithms. Four development stages are defined: pure rule, rule + AI pilot, rule + AI parallel, and AI‑dominant. The AI‑driven anomaly detection pipeline (Fig. 17) includes preprocessing, feature extraction, model training, and Flink‑based streaming detection. Root‑cause diagnosis (still under construction) will subscribe to alerts, collect data, infer causes, and provide actionable recommendations.

4. Achievements

Metrics such as user‑feedback accuracy (Fig. 19) and overall recall rate (Fig. 20) demonstrate continuous improvement. User cases show end‑to‑end alert handling, automatic group creation, and slow‑query optimization suggestions (Figs. 21‑23).

5. Future Outlook

Enhance compute‑storage capacity to support 3‑5 years of growth.

Roll out autonomy in three steps: link root‑cause diagnosis with SOP documents, platformize SOPs, and automate low‑risk scenarios.

Build a flexible anomaly replay system for continuous validation of root‑cause algorithms.

Author

Jin Long, Meituan Basic Technology Department / Database R&D Center / Database Platform Team.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data collection performance monitoring Scalable Architecture Root Cause Analysis Database Autonomy AI-driven ops

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.