Understanding InnoDB Page Structure, Row Formats, and B+Tree Indexes
This article explains InnoDB's 16KB page structure, row formats, record metadata, primary key strategies, and how B+Tree indexes—including clustered, secondary, and covering indexes—organize and retrieve data, with practical SQL examples and performance considerations.
InnoDB stores data in pages of typically 16 KB, which serve as the basic unit of interaction between disk and memory; each I/O operation reads or writes at least one full page.
Rows are stored according to a row format (Compact in the examples) that includes extra information: a variable‑length field length list, a NULL‑value bitmap, and a record header.
Variable‑length columns such as VARCHAR, VARBINARY, TEXT and BLOB store the actual data followed by the length of each field; the lengths are kept in reverse order at the beginning of the record.
The NULL‑value list records which columns are NULL, allowing the engine to omit storage for those values. The record header contains flags such as delete_mask, min_rec_mask, n_owned, heap_no, record_type and next_record, which together describe the record’s state and its position in the page.
Primary‑key generation follows a hierarchy: user‑defined primary key → first UNIQUE key → hidden row_id column if no key is defined.
Example table creation (Compact row format):
CREATE TABLE page_demo(
c1 INT,
c2 INT,
c3 VARCHAR(10000),
PRIMARY KEY (c1)
) CHARSET=ascii ROW_FORMAT=Compact;When inserting rows, InnoDB allocates space from the Free Space area to the User Records area; once the page is full, a new page is allocated. Deleting a row sets delete_mask to 1 and may leave a reusable space that can be reclaimed by later inserts.
The page directory groups records into slots; each slot points to the last record of a group, whose n_owned indicates how many records belong to that group. Binary search on the directory quickly locates the appropriate slot, after which the linked list of records (ordered by primary key) is traversed.
B+Tree indexes consist of leaf nodes that store either full rows (clustered index) or only indexed columns plus the primary key (secondary index). The leaf nodes are linked, and internal nodes contain directory entries that map key ranges to child pages.
A clustered index is the primary storage of the table: the table’s rows are the leaf nodes of the primary‑key B+Tree. Secondary indexes store a copy of the indexed columns plus the primary‑key value, requiring a “back‑lookup” (or “covering”) to retrieve non‑indexed columns.
Index usage patterns include full‑value matches, left‑most prefix matches, range scans, and ORDER BY optimizations. Queries can benefit from covering indexes when the SELECT list contains only indexed columns, eliminating the need for a back‑lookup.
Performance tips: create indexes only on columns used for search, sort, or grouping; prefer high‑cardinality columns; keep indexed column types small; avoid expressions or functions on indexed columns; and insert primary‑key values in increasing order (e.g., AUTO_INCREMENT) to reduce page splits.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
