Databases 27 min read

Meituan Database Fault Detection, Diagnosis, and Kernel Observability Practices

Meituan’s Database Autonomy Platform integrates a four‑layer architecture—data collection agents, Kafka/Flink processing, storage back‑ends, and user interfaces—to continuously gather MySQL metrics, apply dynamic statistical thresholds (MAD, boxplot, EVT) for anomaly detection, and diagnose issues through kernel code‑path tracing, transaction instrumentation, and core‑dump analysis for rapid fault recovery.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
Meituan Database Fault Detection, Diagnosis, and Kernel Observability Practices

1. Background & Goal

MySQL failures and SQL performance are daily concerns for DBAs and developers. Meituan’s Database Autonomy Platform provides end‑to‑end capabilities for anomaly detection, diagnosis, and recovery across multiple scenarios.

2. Platform Evolution Strategy

The platform consists of four layers:

Data Collection Layer : an rds‑agent runs on each instance to collect key metrics and SQL text.

Data Computation & Storage Layer : Kafka, Flink/Spark buffer data; results are stored in ES, Blade, Hive, etc.

Platform Function Layer : serves both DBA and developer users with modules for observability, anomaly detection, root‑cause analysis, fault handling, SQL optimization, workload‑based index advice, and SQL lifecycle governance.

Interface & Presentation Layer : features are exposed via a portal and OpenAPI.

3. Anomaly Detection

Key metrics (e.g., seconds_behind_master, slow_queries, Threads_connected) are monitored. Static thresholds are simple but cause many false alarms; dynamic thresholds are built per‑scenario using statistical models.

3.1 Data Distribution & Algorithm Choice

Three typical time‑series distributions are handled with different algorithms:

Low‑skew symmetric distribution → Median Absolute Deviation (MAD)

Moderate skew → Boxplot

High skew → Extreme Value Theory (EVT)

3.2 Model Selection

After drift detection, the series is segmented. If stationary, boxplot or MAD is used; if periodic, each time bucket is modeled separately. The final algorithm is chosen based on skewness.

3.3 Offline Training & Online Inference

Models are trained offline on historical data, stored in a database, and loaded at runtime by Flink to generate real‑time alerts.

4. Anomaly Diagnosis

Root‑cause analysis is performed by tracing kernel code paths, logs, and core dumps.

4.1 Master‑Slave Lag Diagnosis (Kernel Code Path)

The lag value is computed as:

seconds_behind_master = (long)(time(0) - mi->rli->last_master_timestamp) - mi->clock_diff_with_master

Key variables: last_master_timestamp comes from rli->gaq->head_queue()->ts, which is derived from common_header->when.tv_sec + exec_time. clock_diff_with_master reflects the time difference between master and slave.

Factors such as slave_checkpoint_period and sql_delay can enlarge the lag if they delay timestamp updates.

4.1.1 Kernel Code Path Example

bool mts_checkpoint_routine(Relay_log_info *rli, ulonglong period, bool force, bool need_data_lock) {
    do {
        cnt = rli->gaq->move_queue_head(&rli->workers);
    } while (0);
    ts = rli->gaq->empty() ? 0 : reinterpret_cast<Slave_job_group*>(rli->gaq->head_queue())->ts;
    rli->reset_notified_checkpoint(cnt, ts, need_data_lock, true);
    // ...
}
Slave_worker *Log_event::get_slave_worker(Relay_log_info *rli) {
    if (ends_group() || (!rli->curr_group_seen_begin && (get_type_code() == binary_log::QUERY_EVENT || !rli->curr_group_seen_gtid))) {
        // ...
        ptr_group->ts = common_header->when.tv_sec + (time_t)exec_time; // seconds_behind_master related
        // ...
    }
}

4.2 Large‑Transaction Diagnosis (Kernel Enhancements)

Challenges:

Missing full SQL list for a transaction.

Unclear split between SQL execution time and external wait time.

Solutions:

Assign a unique trx_id at transaction start and attach it to every SQL.

Instrument start/end timestamps to separate execution time from Sleep time, enabling precise breakdown of large‑transaction latency.

4.3 MySQL Crash Analysis (Core Dump)

Crashes are triggered either by MySQL self‑abort (e.g., data corruption, latch timeout) or by the OS. Typical signals:

Signal 6 – disk full/read‑only or data corruption.

Signal 7 – hardware memory error.

Signal 9 – process killed.

Signal 11 – MySQL bug.

Core dump analysis extracts the failing thread’s stack, the SQL in THD::m_query_string, and relevant kernel state.

5. Author & References

Author: Yu‑Feng, Meituan Basic R&D Platform – Database Autonomy Team.

References include GitHub links to Percona Server source, related research papers, and tooling.

anomaly detectionMySQLfault detectionroot cause analysisDatabase OperationsKernel Observability
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.