Why MySQL DDL Changes Can Double Table Size: B+Tree Page‑Split Bug Explained
After a normal DDL change on OneDBA, the table size grew by nearly 100 % and some queries became slow because InnoDB’s B+Tree page‑splitting logic mishandles rows larger than 8 KB, causing each page to hold only one record and corrupting statistics.
Background
During a routine DDL change on the OneDBA platform, developers observed that some SQL statements became slow and the table’s storage size almost doubled. Reproducing the incident showed a defect in MySQL’s handling of B+Tree page splits when a single row exceeds the page size.
InnoDB B+Tree Structure
InnoDB tables are index‑organized. The index is a B+Tree consisting of a root page, internal nodes, and leaf pages. Each page is 16 KB and can store at least two rows (the minimum is a storage guarantee, not a hard row count). Rows larger than 8 KB are stored on an overflow page . Leaf pages hold the actual row data; internal nodes contain child‑page pointers.
Every page contains two virtual records: Infimum (the smallest possible key) and Supremum (the largest possible key). These records are created when the page is allocated and are never removed.
Page‑split mechanisms
Mid‑point split : moves roughly 50 % of the records to a newly allocated page.
Insert‑point split : when inserts are sequential, the engine may split at the current insert point, moving some records to the right page and inserting the new record into the left page.
The core functions implementing these mechanisms are btr_page_split_and_insert, btr_page_get_split_rec_to_right and btr_page_get_split_rec_to_left.
Current lock‑free DDL workflow (dw‑osc)
The platform uses a container‑based, lock‑free DDL tool (internal name dw‑osc ) that is based on the open‑source ghost project (GitHub: https://github.com/github/ghost). Its simplified workflow is:
Create a temporary table with the same schema.
Copy all data from the original table to the temporary table.
Buffer incremental DML statements.
After the full copy finishes, replay the buffered binlog events.
Rename the original table and swap in the temporary table.
Because the binlog replay (step 4) can insert rows with larger primary‑key values before the full copy (step 2) finishes, new rows may be written to the table out‑of‑order.
Why the tablespace inflated
If a large row (≥ 8 KB) is inserted during step 4, InnoDB performs a right‑hand page split that moves the single large record to a new page, leaving the original page with only one record. A 16 KB page that stores a 5 KB row is only ~30 % full, so the tablespace roughly doubles.
Reproduction steps:
CREATE TABLE sbtest (
id INT NOT NULL AUTO_INCREMENT,
pad VARCHAR(12000),
PRIMARY KEY (id)
) ENGINE=InnoDB; INSERT INTO sbtest VALUES (10000, REPEAT('a',5120));
INSERT INTO sbtest VALUES (10001, REPEAT('a',5120));A helper program ( ./peng) shows both rows on page 3. Subsequent sequential inserts trigger the faulty split, eventually leaving many pages with a single record.
Why index statistics became inaccurate
InnoDB gathers index statistics by sampling a fixed number of leaf pages (default 20) and counting distinct keys per page. After counting, the code subtracts 1 from the distinct‑key count to avoid double‑counting adjacent pages. When a page contains only one row, this subtraction reduces the count to zero, dramatically under‑estimating the index cardinality.
Example query shows the drop:
SELECT table_name, index_name, stat_value, sample_size
FROM mysql.innodb_index_stats
WHERE database_name='sbtest' AND table_name='table_1' AND stat_name='n_diff_pfx01';The n_diff_pfx01 statistic fell from ~200 M to ~200 K while the table still held ~400 M rows.
Impact on query performance
MySQL’s optimizer can use either the (incorrect) statistics or a real‑time “index‑dive” estimate. For a query such as:
SELECT * FROM tbl
WHERE user_id=? AND biz=? AND is_del=? AND status IN (?)
ORDER BY modify_time LIMIT 5;the underestimated row count makes the covering index idx_modify_time appear cheaper than the filtering index idx_user_biz_del. The optimizer chooses the covering index, causing excessive row scans and slow response.
Temporary remediation
Running a native ALTER TABLE … ENGINE=InnoDB during a low‑traffic window forces MySQL to rebuild the table and its pages, eliminating the one‑row‑per‑page condition. This restores correct statistics and query performance, but the operation cannot be throttled and may cause replication lag.
Long‑term solution
Modify the DDL tool so that the full data copy completes before any binlog replay, matching the native DDL workflow. This prevents interleaved insertion of large rows that trigger the faulty split path. The change is analogous to the approach taken in gh‑ost’s PR‑1378.
Key code references
btr_cur_optimistic_insert– fast insert when the current page has space. btr_cur_pessimistic_insert – triggers page split. btr_root_raise_and_insert – handles root‑page split. btr_page_split_and_insert – core page‑split logic. btr_page_get_split_rec_to_right – decides right‑hand split (contains the special handling that moves a large record to a new page).
Summary
The issue originates from the interaction between a lock‑free DDL tool’s parallel copy‑and‑replay workflow and InnoDB’s page‑split logic for rows larger than 8 KB. The resulting one‑row‑per‑page pages double the tablespace and cause InnoDB’s index‑statistics algorithm to severely underestimate cardinality, which in turn leads the optimizer to choose sub‑optimal indexes and produce slow queries. A quick fix is to rebuild the table with ALTER TABLE … ENGINE=InnoDB. A permanent fix is to adjust the DDL tool to copy all data before applying binlog events, eliminating the problematic interleaved inserts.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
