Optimizing Massive MySQL Tables: Index Tuning, Online DDL & Batch Deletion
Facing SLA alerts due to slow queries on a billion‑row MySQL table, the article analyzes slow‑query logs, reveals index misuse, and demonstrates how removing a poor composite index, adding targeted indexes, employing online DDL or pt‑osc, and using batch deletions dramatically reduce query times and maintenance overhead.
Background
In a primary‑secondary MySQL instance, SLA alerts were triggered nightly because a large arrival_record table (over 100 million rows) generated slow queries that delayed replication.
Analysis
pt‑query‑digest was used to examine the mysql‑slow.log for the past week.
pt-query-digest --since=148h mysql-slow.log | lessResult part 1 showed a total slow‑query execution time of 25 403 s, the longest query lasting 266 s, an average of 5 s per query, and about 17.6 million rows scanned on average.
Result part 2 displayed the most frequent slow query:
select count(*) from arrival_record where product_id=26 and receive_time between '2019-03-25 14:00:00' and '2019-03-25 15:00:00' and receive_spend_ms>=0This query scanned roughly 32 million rows, averaging 5 s per execution, with a maximum execution time of 266 s.
Execution Plan
explain select count(*) from arrival_record where product_id=26 and receive_time between '2019-03-25 14:00:00' and '2019-03-25 15:00:00' and receive_spend_ms>=0\GThe plan used the composite index IXFK_arrival_record, but only the first column ( product_id) was effective; its low selectivity caused the optimizer to scan over 32 million rows.
According to the left‑most prefix rule, the query could not benefit from the remaining indexed columns ( station_no, sequence, receive_time).
Recommended optimizations:
Drop the composite index IXFK_arrival_record.
Create a new composite index
idx_sequence_station_no_product_id(product_id,sequence,station_no).
Create a single‑column index idx_receive_time(receive_time) because receive_time has high cardinality.
Delete Statement
The delete operation
delete from arrival_record where receive_time < STR_TO_DATE('2019-02-23','%Y-%m-%d')scanned 109 million rows and took about 262 s, as no suitable index existed.
explain select * from arrival_record where receive_time < STR_TO_DATE('2019-02-23','%Y-%m-%d')\GThe plan shows a full table scan (type=ALL) with no possible keys.
Testing and Backup
Table statistics:
SELECT COUNT(*) FROM cq_new_cimiss.arrival_record; -- 112,294,946 rows
-- Disk usage: 48 GB (InnoDB size 31 GB, ~17 GB fragmentation)A parallel backup was performed with mydumper (32 threads), producing a 1.2 GB compressed dump in 52 s.
mydumper -u root -p xxxx -S /datas/mysql/data/3316/mysqld.sock -B cq_new_cimiss -T arrival_record -o /datas/dump_arrival_record -t 32 -r 2000000Online DDL vs pt‑osc
Two approaches were used to rebuild indexes on the test instance:
Online DDL : executed an ALTER TABLE that dropped the foreign key, removed the old composite index, added the new indexes, and later re‑added the foreign key. Total time ≈ 34 min.
pt‑osc : performed the same changes using the Percona Toolkit's pt‑osc. Total time ≈ 57 min.
# Example online DDL
ALTER TABLE cq_new_cimiss.arrival_record DROP FOREIGN KEY FK_arrival_record_product,
DROP INDEX IXFK_arrival_record,
ADD INDEX idx_product_id_sequence_station_no(product_id,sequence,station_no),
ADD INDEX idx_receive_time(receive_time);
# pt‑osc command omitted for brevityPost‑Optimization Results
After adding idx_receive_time, both the select and delete queries use the new index, dramatically reducing the number of scanned rows.
explain select count(*) from arrival_record where receive_time < STR_TO_DATE('2019-03-10','%Y-%m-%d')\G
-- rows: 7,540,948 (type=range, key=idx_receive_time)
explain select count(*) from arrival_record where product_id=26 and receive_time between '2019-03-25 14:00:00' and '2019-03-25 15:00:00' and receive_spend_ms>=0\G
-- rows: 291,448 (type=range, key=idx_receive_time)Even with the new index, the delete of 300 million rows still required about 77 s.
Batch Deletion Strategy
To avoid long pauses and replication lag, records are deleted in small batches using the primary key range:
# Determine the maximum id to delete
SELECT MAX(id) INTO @need_delete_max_id FROM arrival_record WHERE receive_time < '2019-03-01';
# Delete in batches of 20 000 rows
DELETE FROM arrival_record WHERE id < @need_delete_max_id LIMIT 20000;
SELECT ROW_COUNT(); -- returns number of rows deletedThe loop repeats until ROW_COUNT() returns 0, ensuring the table is cleaned without overwhelming the primary instance.
Conclusion
Large tables need carefully designed indexes; low‑selectivity composite indexes should be avoided.
Online DDL or pt‑osc can safely rebuild indexes with minimal downtime.
Batch deletions reduce impact on the primary server and keep replication lag low.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Backend Technology
Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
