Databases 19 min read

How Meituan’s Database Autonomy Service Tackles Scale and Reliability Challenges

This article outlines the evolution of Meituan’s Database Autonomy Service (DAS), describing the growing scale‑vs‑operations imbalance, the strategic roadmap for self‑service and AI‑driven diagnostics, detailed architectural designs across data collection, compute/storage, and analysis layers, and the measurable outcomes and future plans for full database autonomy.

ITPUB

May 9, 2022

How Meituan’s Database Autonomy Service Tackles Scale and Reliability Challenges

Current Situation

Rapid growth of Meituan database instances created a pronounced imbalance between scale and operational capability, leading to frequent failures and long MTTR. Analysis of past incidents shows that about 80% of MTTR is spent on manual analysis and定位.

Solution Overview

Short‑term focus is to reduce analysis time (high ROI). Long‑term vision is to build a platform that enables self‑service, automation, and AI‑assisted root‑cause analysis, forming a virtuous “flywheel” for operations.

Architecture Evolution

The system evolves through four stages: platformization → self‑service → intelligence (expert rules + AI) → automation.

Data Collection Layer

Data quality is critical; the design balances collection fidelity with instance stability. A temporary pcap‑based packet capture solution is used before migrating to kernel‑level agents.

Impact tests on MySQL instances show negligible performance degradation.

Compute & Storage Layer

Design principles include full‑memory computation, raw data reporting with aggressive compression, controlled memory usage, and minimal impact on MySQL instances.

Full‑SQL statements are aggregated per minute using the key RDS_IP_DBName_SQL_Template_ID_Time_Minute.

Compression pipeline (message compression → pre‑aggregation → dictionary → minute‑level aggregation) reduces bandwidth and storage.

Delayed messages beyond one minute are placed into a compensation queue (Mafka) to avoid data loss.

Analysis & Decision Layer

Root‑cause inference combines expert rules (GRAI methodology) with AI models. The roadmap progresses from rule‑only to AI‑dominant stages.

Rule‑Based Approach – Using the GRAI framework, reliable rules (e.g., master‑slave latency) are extracted through continuous replay and refinement.

AI‑Based Detection – Historical metric modeling (pre‑processing, feature extraction, classification) is serialized for Flink streaming detection.

Evaluation Metrics

A closed‑loop process (alert → replay → root‑cause analysis → improvement → verification) has improved both recall and precision of root‑cause detection.

Key User Cases

Integration with Meituan’s internal IM (DaXiang) enables automatic alert push, root‑cause view, one‑click remediation, and feedback tracking.

Future Outlook

After two years, DAS has solidified core capabilities (pre‑deployment SQL risk detection, index‑change safety checks). Future work focuses on three directions:

Enhance compute and storage capacity to support 3‑5 years of growth.

Roll out autonomous SOP workflows in three stages: (1) link root‑cause diagnosis to SOP documents; (2) platformize SOP management; (3) automate low‑risk actions to achieve full database autonomy.

Build a flexible anomaly replay system for continuous model validation and improvement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations performance monitoring Database Autonomy AI Diagnosis

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.