Alibaba Interview: Key Considerations for Indexing Tens‑Millions‑Row Tables
The article explains how to safely add indexes to a tens‑of‑millions‑row MySQL table, covering lock duration, disk‑space impact, write‑performance degradation, and six practical principles—selective columns, proper order, covering indexes, avoiding redundancy, short‑lock tools, and ongoing monitoring—plus pros, cons, and suitable use cases.
Introduction
During an interview a candidate was asked how to safely and efficiently add indexes to a tens‑of‑millions‑row order table. The interviewer expects knowledge beyond the basic ALTER TABLE ADD INDEX syntax.
Key Risks on Large Tables
Long table‑lock time : Even with MySQL 5.6+ online DDL there is a brief exclusive metadata lock and the operation consumes significant I/O and CPU.
Disk‑space explosion : Adding an index creates a shadow copy of the table, which can require dozens of gigabytes of temporary space.
Write‑performance drop : Every INSERT, UPDATE and DELETE must also modify the new index, which is especially harmful on large tables.
Six Practical Principles
“列要精,序要对,成覆盖,忌重复,锁要短,频监控”
1. Column selectivity (“列要精”)
Choose columns with high selectivity ( COUNT(DISTINCT col) / COUNT(*) close to 1). High‑cardinality columns such as user_id or order_id are good candidates; low‑cardinality columns like status are not.
Wrong example :
ALTER TABLE orders ADD INDEX idx_status (status);On a tens‑of‑millions table the condition status='PAID' may match millions of rows and be slower than a full table scan.
Correct approach : Put low‑selectivity columns at the end of a composite index, e.g.
ALTER TABLE orders ADD INDEX idx_user_status (user_id, status);2. Index column order (“序要对”)
Follow the left‑most prefix rule. An index on (create_time, user_id) can be used for queries that filter by create_time first, but a query that only filters by user_id cannot use it.
ALTER TABLE orders ADD INDEX idx_createtime_user (create_time, user_id);Range columns ( > / <) should be placed after equality columns.
-- Good: INDEX (user_id, create_time)
-- Bad: INDEX (create_time, user_id)3. Covering index (“成覆盖”)
A covering index contains all columns required by the query, eliminating the need to read the clustered table.
SELECT user_id, status FROM orders WHERE user_id = 123;Creating idx_user_status (user_id, status) makes the query use the index only, showing Using index in the EXPLAIN output and improving performance several‑fold.
4. Avoid duplicate/redundant indexes (“忌重复”)
Duplicate indexes have identical column lists; redundant indexes are prefixes of another index. Use sys.schema_redundant_indexes to detect them and drop the unnecessary one.
SELECT * FROM sys.schema_redundant_indexes
WHERE table_schema='your_db' AND table_name='orders';5. Keep locks short (“锁要短”)
Prefer online tools such as pt-online-schema-change or gh-ost that create a shadow table, apply the index, and swap tables with minimal locking.
pt-online-schema-change --alter "ADD INDEX idx_user (user_id)" \
D=test,t=orders --executeAdvantages: non‑blocking reads/writes; disadvantages: extra disk space and longer execution time (hours for tens of millions of rows).
6. Monitor index usage (“频监控”)
Periodically query performance_schema.table_io_waits_summary_by_index_usage to see COUNT_READ. Indexes with zero reads for a long period can be removed. Run OPTIMIZE TABLE or rebuild indexes to eliminate fragmentation.
SELECT * FROM performance_schema.table_io_waits_summary_by_index_usage
WHERE object_schema='your_db' AND object_name='orders';Pros, Cons and Suitable Scenarios
Pros: dramatically faster queries, covering indexes reduce I/O, complex queries can drop from minutes to milliseconds.
Cons: extra disk space (30‑100% of data size), slower writes, higher operational risk.
Best for read‑heavy workloads with stable, high‑frequency filter columns; not suitable for write‑heavy log tables or tiny tables.
Conclusion
When asked about indexing large tables, answer by evaluating necessity, selecting high‑selectivity columns, ordering composite indexes correctly, using covering indexes, removing redundancy, applying changes with online tools, and continuously monitoring usage.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Su San Talks Tech
Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
