Databases 18 min read

How Meituan Uses AI to Detect Database Anomalies in Real Time

Meituan's database platform team built an AI‑driven anomaly detection service that automatically extracts feature patterns, selects appropriate statistical algorithms, trains models, and performs both offline and online monitoring to quickly locate and mitigate database issues across diverse production scenarios.

dbaplus Community
dbaplus Community
dbaplus Community
How Meituan Uses AI to Detect Database Anomalies in Real Time

Background

Database systems are critical to Meituan’s core services, demanding high stability and near‑zero tolerance for anomalies. Traditional static‑threshold alerts rely on expert‑defined rules and cannot adapt to varying workloads, often allowing minor issues to evolve into major outages.

To address these limitations, the team developed an AI‑based detection service that continuously monitors historical performance, identifies emerging risks, and assists engineers in early diagnosis and mitigation.

Feature Analysis

The first step is to uncover regularities in time‑series metrics collected from production databases. Visual inspection of metric distributions reveals three dominant patterns: periodic, drift, and stationary behavior.

1) Periodic Variation

Many metrics exhibit regular cycles due to daily traffic peaks or scheduled jobs. Detecting these cycles involves extracting the trend component with a moving average, computing a rolling autocorrelation of the residual series, and locating peaks in the autocorrelation to estimate the period T. The process is illustrated in the following diagram:

2) Drift Variation

When a series shows a gradual shift in mean or abrupt global changes, it is considered drift. The team applies a median‑filter‑based drift detector: first compute a sliding median to obtain a smoothed trend, then check whether the smoothed sequence is strictly monotonic (indicating a long‑term trend) or apply two rules on left/right windows to spot sudden upward or downward jumps.

3) Stationary Variation

Stationarity is tested with the Augmented Dickey‑Fuller (ADF) test. If recent 1‑day or 7‑day windows yield p‑values below 0.05, the series is deemed stationary and can be modeled directly.

Algorithm Selection

Based on the observed distribution shape, three statistical methods are chosen:

Low‑skew, high‑symmetry: Median Absolute Deviation (MAD)

Moderate skew: Boxplot

High skew: Extreme Value Theory (EVT)

The selection flow is shown below:

Model Training & Real‑Time Detection

Data Flow

Real‑time detection runs on Apache Flink, consuming messages from Meituan’s internal Mafka queue, writing results to Elasticsearch, and generating anomaly records.

Offline training uses Squirrel (a KV store) as a task queue, reads training data from the MOD data warehouse, applies configuration parameters, trains models, stores them in Elasticsearch, and supports both scheduled and manual triggers.

Anomaly Detection Process

The detection pipeline follows a divide‑and‑conquer approach: offline, historical data are pre‑processed, classified (drift, stationary, periodic), and modeled with the chosen algorithm; online, the trained model is loaded to evaluate incoming metrics in real time.

Product Operation

The service is integrated into Horae, Meituan’s extensible time‑series anomaly platform, enabling a closed loop of detection, case storage, analysis, optimization, evaluation, and deployment. Current performance metrics are:

Precision: 81 % (randomly sampled anomalies manually verified)

Recall: 82 % (based on known fault cases)

F1‑score: 81 %

Future Outlook

Introduce anomaly type classification (mean shift, variance change, spikes) to support subscription‑based alerts and downstream diagnosis.

Build a Human‑in‑the‑Loop feedback system for continuous model improvement.

Extend support to additional database scenarios such as end‑to‑end error reporting and node‑level network monitoring.

Appendix

1) Median Absolute Deviation (MAD)

MAD measures dispersion robustly: MAD = median(|x_i - median(x)|). For normally distributed data, a scaling factor C≈1.4826 is applied, and a threshold k≈3 is typical.

2) Boxplot

Boxplot visualizes five summary statistics (min, Q1, median, Q3, max) and defines outliers as points beyond 1.5 × IQR from the quartiles.

3) Extreme Value Theory (EVT)

EVT models the tail of a distribution without assuming a specific underlying form. Using the Generalized Pareto Distribution (GPD), parameters are estimated via maximum likelihood, and thresholds are derived based on a chosen risk level q.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIreal-time monitoringstatistical methodsTime Series AnalysisDatabase Anomaly Detection
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.