Databases 16 min read

How Meituan Merges Cost‑Based and AI‑Driven Index Recommendation for Billions of Slow Queries

Facing over a hundred million daily slow queries, Meituan’s Database Autonomous Service combines traditional cost‑based index selection with an AI‑powered, data‑driven classifier to automatically generate and validate missing indexes, achieving a 12.16% uplift in index adoption and saving thousands of hours of query execution time.

Architect

Jul 16, 2023

How Meituan Merges Cost‑Based and AI‑Driven Index Recommendation for Billions of Slow Queries

Background

Meituan’s DAS platform processes over 100 million slow queries per day. Manually analyzing each query and creating appropriate indexes is infeasible, prompting an automated solution.

Problems with Pure Cost‑Based Recommendations

Cost‑based advice depends on the optimizer’s cost estimator, which is not perfectly accurate (Leis et al., 2015) and can miss or mis‑select indexes.

Evaluating many candidate indexes requires creating and dropping real indexes. MySQL lacks built‑in fake‑index support, so a custom storage engine (see HypoPG project) would be needed, incurring high development cost.

AI + Data‑Driven Approach

The collaboration with East China Normal University treats index recommendation as a binary classification problem. For each candidate index the model predicts whether adding the index will improve query performance, without invoking the optimizer or fake‑index machinery.

Data Collection and Labeling

Each day the cost‑based module generates candidate indexes and validates them on a sampled replica. Validated indexes become positive examples; the rest become negatives. This yields a large, trustworthy training set for the AI models.

Model Architecture

Three XGBoost classifiers are trained separately for single‑column, two‑column, and three‑column indexes. Each classifier consumes a feature vector built from:

Statement features : one‑hot encoding of the column’s position in SELECT, WHERE, JOIN, ORDER BY, GROUP BY, etc.

Statistical features : table row count, column cardinality, selectivity, and other metrics that indicate the potential benefit of indexing.

Candidate Generation

Columns appearing in aggregation, WHERE, JOIN, ORDER BY, and GROUP BY clauses become single‑column candidates. All pairwise and triple‑wise combinations are formed, then existing indexes are removed following the left‑most prefix rule.

Training Example

Consider a query joining customer and warehouse. Columns c_w_id, c_id, c_d_id, c_last from customer and w_id from warehouse are extracted, yielding four single‑column candidates for customer and one for warehouse. The query uses the index Idx(w_id), so the feature vector for (w_id) is labeled positive; all others are negative. Because the query only uses a three‑column index Idx(c_d_id, c_id, c_last), two‑column candidates are omitted to avoid severe class imbalance.

Prediction and Validation Pipeline

At inference time, the system generates all single‑, double‑, and triple‑column candidates for a slow query, builds their feature vectors, and feeds them to the three classifiers. From each classifier the candidate with the highest probability is selected (one per column‑arity). Before presenting recommendations to users, the selected indexes are created on a sampled replica; only those that demonstrably reduce execution time are forwarded.

Operational Results

In March, the AI‑augmented pipeline contributed an additional 12.16 % of index recommendations that users accepted, improving roughly 52 billion query executions and shaving about 4 632 hours of total runtime.

Future Directions

Planned work includes fine‑tuning open‑source large language models (e.g., Google’s T5) to generate index suggestions directly from query text, and falling back on textual SQL‑tuning advice (e.g., removing unnecessary columns, replacing sub‑queries with joins) when index creation cannot improve performance.

References

Leis V, Gubichev A, Mirchev A, et al. 2015. How good are query optimizers, really? VLDB Endow. 9, 3: 204‑215.

HypoPG project: https://github.com/HypoPG/hypopg

Kossmann J, Halfpap S, Jankrift M, et al. 2020. Magic mirror in my hand… VLDB Endow. 13, 12: 2382‑2395.

Piatetsky‑Shapiro G. 1983. The optimal selection of secondary indices is NP‑complete. SIGMOD Rec. 13, 2: 72‑75.

Zhou X, Liu L, Li W, et al. 2022. AutoIndex: An incremental index management system for dynamic workloads. ICDE: 2196‑2208.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI MySQL Cost-Based Optimization Index Recommendation database autotuning

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Problems with Pure Cost‑Based Recommendations

AI + Data‑Driven Approach

Data Collection and Labeling

Model Architecture

Candidate Generation

Training Example

Prediction and Validation Pipeline

Operational Results

Future Directions

References

Architect

How this landed with the community

Was this worth your time?

0 Comments

AI + Data‑Driven Approach