Databases 15 min read

AI-Driven and Cost-Based Index Recommendation for Slow Queries at Meituan

This article describes Meituan's collaboration with East China Normal University to improve slow‑query performance by combining traditional cost‑based index recommendation with a novel AI‑plus‑data‑driven approach, detailing the overall architecture, modeling process, experimental results, and future plans for large‑model integration.

DataFunSummit

Apr 13, 2023

AI-Driven and Cost-Based Index Recommendation for Slow Queries at Meituan

1 Background

Meituan generates over a hundred million slow queries daily, making manual index tuning infeasible; the DAS platform already provides cost‑based index suggestions but suffers from inaccurate cost estimates and the lack of fake‑index support in MySQL.

To address these issues, Meituan partnered with East China Normal University to integrate an AI‑plus‑data‑driven index recommendation method alongside the existing cost‑based approach.

2 Index Recommendation Overview

Index recommendation can be performed at the workload level (optimizing a set of queries under storage constraints) or at the query level (suggesting missing indexes for a single slow query).

2.1 Cost‑Based Index Recommendation

Cost‑based methods evaluate the optimizer’s estimated query cost for each candidate index, searching for a set that minimizes overall workload cost; the problem is NP‑hard and typically solved with greedy heuristics, which may yield sub‑optimal solutions.

2.2 AI‑plus‑Data‑Driven Index Recommendation

The AI approach treats index recommendation as a binary classification problem, training XGBoost models (single‑column, two‑column, and three‑column) on labeled data collected from the cost‑based system, allowing predictions without relying on optimizer estimates or fake indexes.

3 Overall Architecture

The system consists of two main components: model training and model deployment.

3.1 Model Training

Training data are collected from daily cost‑based recommendations (queries and validated indexes). Candidate single‑, double‑, and triple‑column indexes are generated, features are engineered, and the data are labeled according to whether the index proved effective.

3.2 Model Deployment

For each incoming slow query, candidate indexes are generated, feature vectors are built, and the corresponding classification models predict which candidates are effective; predicted indexes are created on a sampling database to verify actual performance improvement before being presented to users.

4 Modeling Process

4.1 Candidate Index Generation

Columns appearing in aggregation, WHERE, JOIN, ORDER BY, and GROUP BY clauses become single‑column candidates; permutations produce double‑ and triple‑column candidates, while existing indexes are removed following the left‑most prefix rule.

4.2 Feature Engineering

Features include one‑hot encoded statement positions and statistical metrics such as table row count, cardinality, and selectivity; multi‑column features are constructed by concatenating single‑column features and adding joint cardinality.

4.3 Modeling Example

An example query involving tables customer and warehouse illustrates how five single‑column samples (one positive, four negative) and six triple‑column samples (one positive, five negative) are generated and labeled.

4.4 Model Prediction and Index Evaluation

All candidate indexes are scored by their respective models; the highest‑probability single, double, and triple indexes are selected, then validated on the sampling database to ensure true performance gains before recommendation.

5 Project Operation

In March, the AI model contributed an additional 12.16% of recommended indexes that were adopted by users, improving roughly 5.2 billion query executions and saving about 4,632 hours of execution time.

6 Future Plans

The team intends to fine‑tune open‑source large language models (e.g., Google’s T5) to generate index suggestions directly from slow queries, and to provide textual SQL optimization advice when index creation is insufficient.

7 Authors

Peng Gan, Meituan Basic R&D Platform Engineer, responsible for DAS SQL optimization.

8 Acknowledgments

Special thanks to Prof. Cai Peng of East China Normal University for his contributions; Meituan encourages further collaborations across AI, big data, IoT, and autonomous driving.

9 References

[1] Leis et al., 2015. How good are query optimizers, really? VLDB. [2] https://github.com/HypoPG/hypopg [3] Kossmann et al., 2020. Magic mirror in my hand… VLDB. [4] Piatetsky‑Shapiro, 1983. The optimal selection of secondary indices is NP‑complete. [5] Zhou et al., 2022. AutoIndex: An incremental index management system for dynamic workloads. ICDE.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning .ai database optimization Meituan cost-based Index Recommendation

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.