Databases 10 min read

How Machine Learning Can Automate MySQL Index Optimization

This article explains how applying machine learning to database operations—specifically AIOps for MySQL—can automate index recommendation by parsing SQL, extracting semantic and statistical features, generating candidate index combinations, and training an XGBoost model to predict optimal indexes, reducing reliance on manual DBA work.

Efficient Ops
Efficient Ops
Efficient Ops
How Machine Learning Can Automate MySQL Index Optimization

1. Background

In the DBA community, it is often said that mastering indexes solves 80% of database problems, highlighting the importance of indexes for relational databases. Traditional DBA work focuses on SQL optimization, which mainly involves index tuning and query rewriting. When query volumes reach billions of slow queries per day, manual tuning becomes impractical.

Rule‑based and cost‑based automatic index recommendation tools have emerged, but they struggle with diverse workloads, varying developer skill levels, and sometimes inaccurate statistics. To address these limitations, the project introduces machine learning into the traditionally manual DBA domain, known as AIOps.

2. Principles and Key Process

SQL optimization traditionally follows two paths: (1) analyzing fields used in queries, conditions, joins, and aggregations to decide which indexes to create; (2) rewriting queries or using hints to guide the optimizer. This project focuses on the first path—index recommendation.

Key points for index recommendation include identifying which fields are used for retrieval, which fields appear in multi‑table joins, and the statistical distribution of tables and columns.

2.1 Syntax Parsing & Semantic Feature Extraction

The project uses the Python

sqlparse

library to parse SQL statements and extract semantic features such as WHERE, JOIN, aggregation functions, ORDER BY, GROUP BY, and common equality or range predicates.

2.2 Statistical Feature Extraction

Beyond semantic features, the system extracts table and column statistics from the database metadata, such as total row count, column type, null count, distinct value count, and selectivity. For MySQL, statistics are stored in

mysql.innodb_index_stats

and

mysql.innodb_table_stats

, with histograms available in

information_schema.column_statistics

for MySQL 8.0.

2.3 Index Combination Generation

After extracting features, the system generates candidate index combinations, including single‑column, two‑column, and three‑column indexes (excluding >3 columns). Each SQL can use only one index per table, so all relevant fields are combined to produce candidate sets.

2.4 Feature Engineering & Modeling

Features for each candidate index consist of semantic features and statistical features. For multi‑column indexes, features of each column are concatenated in order, preserving column sequence. Positive samples are indexes actually used by a query; all others are negative.

2.5 Model Training and Prediction

The prepared training set undergoes preprocessing (missing value handling, zero‑row removal, type normalization). Separate XGBoost binary classification models are trained for single‑column and multi‑column indexes. Trained models are saved with

joblib

and loaded online for real‑time index recommendation. Feedback from deployments can be fed back to retrain and improve the model.

3. Future Plans

The current implementation supports only MySQL; future work includes extending to other relational databases such as openGauss.

With the rise of large language models, the project also plans to explore fine‑tuning open‑source LLMs for index recommendation to further enhance user experience.

machine learningFeature EngineeringSQLMySQLIndex OptimizationAIOps
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.