Databases 19 min read

How Meituan Built a Scalable Autonomous Database System to Slash MTTR

This article details Meituan's journey from rapid database growth and operational bottlenecks to a multi‑year roadmap that combines platform‑level monitoring, rule‑based and AI‑enhanced root‑cause analysis, and automated remediation, ultimately delivering measurable improvements in alert accuracy, recall rates, and overall database reliability.

dbaplus Community
dbaplus Community
dbaplus Community
How Meituan Built a Scalable Autonomous Database System to Slash MTTR

Background

Rapid growth of Meituan’s MySQL instances outpaced DBA capacity, causing long MTTR for incidents. Incident analysis showed that ~80% of failure time was spent on diagnosis, highlighting the need for automated data collection, observability, and root‑cause identification.

Solution Overview

The team defined a four‑phase evolution: short‑term ROI by improving alert analysis, medium‑term self‑service platform, long‑term AI‑driven automation, and full autonomy. The design combines expert rules (GRAI methodology) with AI models for anomaly detection and root‑cause classification.

Technical Architecture

Top‑Level Design

Four evolutionary steps – platformization, self‑service, intelligence, automation – are illustrated in the layered diagram.

Data Collection Layer

Initially a pcap‑based packet capture was used to avoid kernel‑level dependencies; a future transition to kernel agents is planned. The agent runs with minimal impact on MySQL instances, focusing on high‑quality data ingestion.

Evolution strategy diagram
Evolution strategy diagram

Compute & Storage Layer

All computation is performed in‑memory within a single thread/process to maximize throughput.

Raw data is reported with aggressive compression; memory usage is bounded to prevent overflow.

Design principles: full‑in‑memory computation, raw data reporting, extreme compression, and controlled memory consumption.

Analysis & Decision Layer

Two complementary approaches:

Rule‑based GRAI : expert knowledge distilled into actionable rules (e.g., master‑slave latency detection).

AI models : Flink‑driven streaming detection builds offline models from historical metrics, serializes them, and applies them in real‑time to detect anomalies.

The pipeline includes data preprocessing, feature extraction (time‑series, text, domain‑specific), classification, root‑cause ranking, and expansion (SQL behavior analysis, log correlation, dimension drill‑down).

Key Technical Details

Agent Impact Evaluation

Performance tests measured CPU, memory, and latency overhead on MySQL instances, confirming negligible impact.

Agent impact test
Agent impact test

Full‑SQL Aggregation

To handle massive SQL traffic, messages are aggregated per minute using the key:

AggKey = RDS_IP + '_' + DBName + '_' + SQL_Template_ID + '_' + MinuteTimestamp

Aggregated data is then compressed through message compression, pre‑aggregation, dictionary encoding, and minute‑level aggregation.

SQL template aggregation design
SQL template aggregation design

Compensation Mechanism

Messages delayed beyond one minute are placed into a compensation queue (Mafka) to avoid data loss during traffic spikes. The compensated data is later merged to ensure completeness for downstream analysis.

Full‑SQL compensation design
Full‑SQL compensation design

Root‑Cause Classification Pipeline

Data Collection : performance metrics, status snapshots, system metrics, logs, and hardware signals.

Feature Extraction : time‑series features, textual features, and domain‑specific features derived from database knowledge.

Root‑Cause Classification : preprocessing → feature selection → model inference → ranking.

Root‑Cause Expansion : deeper analysis such as SQL behavior, expert rule correlation, and log mining.

Results

Metrics show steady improvement in user‑feedback accuracy and overall root‑cause recall after deploying the platform.

User feedback accuracy
User feedback accuracy
Root‑cause recall
Root‑cause recall

Future Plans

Scale compute and storage capacity to support 3‑5 years of growth.

Gradually roll out autonomous capabilities for low‑risk scenarios.

Build a flexible anomaly‑replay system to continuously validate and improve root‑cause algorithms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringperformanceAIAutomationScalabilitydatabase
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.