Databases 25 min read

Why MySQL DDL Changes Can Double Table Size: B+Tree Page‑Split Bug Explained

After a normal DDL change on OneDBA, the table size grew by nearly 100 % and some queries became slow because InnoDB’s B+Tree page‑splitting logic mishandles rows larger than 8 KB, causing each page to hold only one record and corrupting statistics.

dbaplus Community
dbaplus Community
dbaplus Community
Why MySQL DDL Changes Can Double Table Size: B+Tree Page‑Split Bug Explained

Background

During a routine DDL change on the OneDBA platform, developers observed that some SQL statements became slow and the table’s storage size almost doubled. Reproducing the incident showed a defect in MySQL’s handling of B+Tree page splits when a single row exceeds the page size.

InnoDB B+Tree Structure

InnoDB tables are index‑organized. The index is a B+Tree consisting of a root page, internal nodes, and leaf pages. Each page is 16 KB and can store at least two rows (the minimum is a storage guarantee, not a hard row count). Rows larger than 8 KB are stored on an overflow page . Leaf pages hold the actual row data; internal nodes contain child‑page pointers.

Every page contains two virtual records: Infimum (the smallest possible key) and Supremum (the largest possible key). These records are created when the page is allocated and are never removed.

Page‑split mechanisms

Mid‑point split : moves roughly 50 % of the records to a newly allocated page.

Insert‑point split : when inserts are sequential, the engine may split at the current insert point, moving some records to the right page and inserting the new record into the left page.

The core functions implementing these mechanisms are btr_page_split_and_insert, btr_page_get_split_rec_to_right and btr_page_get_split_rec_to_left.

Current lock‑free DDL workflow (dw‑osc)

The platform uses a container‑based, lock‑free DDL tool (internal name dw‑osc ) that is based on the open‑source ghost project (GitHub: https://github.com/github/ghost). Its simplified workflow is:

Create a temporary table with the same schema.

Copy all data from the original table to the temporary table.

Buffer incremental DML statements.

After the full copy finishes, replay the buffered binlog events.

Rename the original table and swap in the temporary table.

Because the binlog replay (step 4) can insert rows with larger primary‑key values before the full copy (step 2) finishes, new rows may be written to the table out‑of‑order.

Why the tablespace inflated

If a large row (≥ 8 KB) is inserted during step 4, InnoDB performs a right‑hand page split that moves the single large record to a new page, leaving the original page with only one record. A 16 KB page that stores a 5 KB row is only ~30 % full, so the tablespace roughly doubles.

Reproduction steps:

CREATE TABLE sbtest (
  id INT NOT NULL AUTO_INCREMENT,
  pad VARCHAR(12000),
  PRIMARY KEY (id)
) ENGINE=InnoDB;
INSERT INTO sbtest VALUES (10000, REPEAT('a',5120));
INSERT INTO sbtest VALUES (10001, REPEAT('a',5120));

A helper program ( ./peng) shows both rows on page 3. Subsequent sequential inserts trigger the faulty split, eventually leaving many pages with a single record.

Why index statistics became inaccurate

InnoDB gathers index statistics by sampling a fixed number of leaf pages (default 20) and counting distinct keys per page. After counting, the code subtracts 1 from the distinct‑key count to avoid double‑counting adjacent pages. When a page contains only one row, this subtraction reduces the count to zero, dramatically under‑estimating the index cardinality.

Example query shows the drop:

SELECT table_name, index_name, stat_value, sample_size
FROM mysql.innodb_index_stats
WHERE database_name='sbtest' AND table_name='table_1' AND stat_name='n_diff_pfx01';

The n_diff_pfx01 statistic fell from ~200 M to ~200 K while the table still held ~400 M rows.

Impact on query performance

MySQL’s optimizer can use either the (incorrect) statistics or a real‑time “index‑dive” estimate. For a query such as:

SELECT * FROM tbl
WHERE user_id=? AND biz=? AND is_del=? AND status IN (?)
ORDER BY modify_time LIMIT 5;

the underestimated row count makes the covering index idx_modify_time appear cheaper than the filtering index idx_user_biz_del. The optimizer chooses the covering index, causing excessive row scans and slow response.

Temporary remediation

Running a native ALTER TABLE … ENGINE=InnoDB during a low‑traffic window forces MySQL to rebuild the table and its pages, eliminating the one‑row‑per‑page condition. This restores correct statistics and query performance, but the operation cannot be throttled and may cause replication lag.

Long‑term solution

Modify the DDL tool so that the full data copy completes before any binlog replay, matching the native DDL workflow. This prevents interleaved insertion of large rows that trigger the faulty split path. The change is analogous to the approach taken in gh‑ost’s PR‑1378.

Key code references

btr_cur_optimistic_insert

– fast insert when the current page has space. btr_cur_pessimistic_insert – triggers page split. btr_root_raise_and_insert – handles root‑page split. btr_page_split_and_insert – core page‑split logic. btr_page_get_split_rec_to_right – decides right‑hand split (contains the special handling that moves a large record to a new page).

Summary

The issue originates from the interaction between a lock‑free DDL tool’s parallel copy‑and‑replay workflow and InnoDB’s page‑split logic for rows larger than 8 KB. The resulting one‑row‑per‑page pages double the tablespace and cause InnoDB’s index‑statistics algorithm to severely underestimate cardinality, which in turn leads the optimizer to choose sub‑optimal indexes and produce slow queries. A quick fix is to rebuild the table with ALTER TABLE … ENGINE=InnoDB. A permanent fix is to adjust the DDL tool to copy all data before applying binlog events, eliminating the problematic interleaved inserts.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

statisticsInnoDBmysqlB+TreeDDL
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.