Databases 19 min read

How Meituan’s Database Autonomy Service Tackles Scale and Reliability Challenges

This article outlines the evolution of Meituan’s Database Autonomy Service (DAS), describing the growing scale‑vs‑operations imbalance, the strategic roadmap for self‑service and AI‑driven diagnostics, detailed architectural designs across data collection, compute/storage, and analysis layers, and the measurable outcomes and future plans for full database autonomy.

ITPUB
ITPUB
ITPUB
How Meituan’s Database Autonomy Service Tackles Scale and Reliability Challenges

Current Situation

Rapid growth of Meituan database instances created a pronounced imbalance between scale and operational capability, leading to frequent failures and long MTTR. Analysis of past incidents shows that about 80% of MTTR is spent on manual analysis and定位.

Database instance growth trend
Database instance growth trend

Solution Overview

Short‑term focus is to reduce analysis time (high ROI). Long‑term vision is to build a platform that enables self‑service, automation, and AI‑assisted root‑cause analysis, forming a virtuous “flywheel” for operations.

Architecture Evolution

The system evolves through four stages: platformization → self‑service → intelligence (expert rules + AI) → automation.

Top‑level architecture design
Top‑level architecture design

Data Collection Layer

Data quality is critical; the design balances collection fidelity with instance stability. A temporary pcap‑based packet capture solution is used before migrating to kernel‑level agents.

Agent technical design
Agent technical design

Impact tests on MySQL instances show negligible performance degradation.

Agent impact test
Agent impact test

Compute & Storage Layer

Design principles include full‑memory computation, raw data reporting with aggressive compression, controlled memory usage, and minimal impact on MySQL instances.

Compute and storage architecture
Compute and storage architecture

Full‑SQL statements are aggregated per minute using the key RDS_IP_DBName_SQL_Template_ID_Time_Minute.

SQL template aggregation
SQL template aggregation

Compression pipeline (message compression → pre‑aggregation → dictionary → minute‑level aggregation) reduces bandwidth and storage.

Full‑SQL compression results
Full‑SQL compression results

Delayed messages beyond one minute are placed into a compensation queue (Mafka) to avoid data loss.

Full‑SQL compensation design
Full‑SQL compensation design

Analysis & Decision Layer

Root‑cause inference combines expert rules (GRAI methodology) with AI models. The roadmap progresses from rule‑only to AI‑dominant stages.

Analysis decision design
Analysis decision design

Rule‑Based Approach – Using the GRAI framework, reliable rules (e.g., master‑slave latency) are extracted through continuous replay and refinement.

Expert rule refinement
Expert rule refinement

AI‑Based Detection – Historical metric modeling (pre‑processing, feature extraction, classification) is serialized for Flink streaming detection.

AI anomaly detection design
AI anomaly detection design

Evaluation Metrics

A closed‑loop process (alert → replay → root‑cause analysis → improvement → verification) has improved both recall and precision of root‑cause detection.

User feedback accuracy
User feedback accuracy
Root‑cause recall rate
Root‑cause recall rate

Key User Cases

Integration with Meituan’s internal IM (DaXiang) enables automatic alert push, root‑cause view, one‑click remediation, and feedback tracking.

Lock contention alert example
Lock contention alert example
Slow‑query optimization suggestion
Slow‑query optimization suggestion

Future Outlook

After two years, DAS has solidified core capabilities (pre‑deployment SQL risk detection, index‑change safety checks). Future work focuses on three directions:

Enhance compute and storage capacity to support 3‑5 years of growth.

Roll out autonomous SOP workflows in three stages: (1) link root‑cause diagnosis to SOP documents; (2) platformize SOP management; (3) automate low‑risk actions to achieve full database autonomy.

Build a flexible anomaly replay system for continuous model validation and improvement.

operationsPerformance MonitoringDatabase AutonomyAI Diagnosis
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.