InnoDB Storage Formats, System Pages, IO Subsystem, and Buffer Pool Management in MySQL 5.7
This article explains MySQL 5.7 InnoDB's traditional compressed storage format, system page structures, external storage pages, encrypted and R‑TREE pages, the IO subsystem including asynchronous AIO handling and read‑ahead strategies, as well as buffer pool initialization, management, and concurrency control.
Traditional Compressed Storage Format
When a table is created or altered with row_format=compressed and a key_block_size of 1, 2, 4, or 8 KB, the resulting .ibd file is divided into blocks of the specified size. The compressed page format includes a log ( mlog ) that records DML operations to avoid repeated recompression.
Insert: write the full record to the mlog.
Update: either a delete‑insert update (mark old dense slot as deleted and write a new record) or an in‑place update (write the new record directly).
Delete: mark the dense slot as deleted.
Compression and decompression are performed by the functions page_zip_compress and page_zip_decompress .
System Data Pages
All non‑independent pages stored in ibdata are called system data pages. Important system pages include:
FSP_IBUF_HEADER_PAGE_NO : header page for the change buffer (type FIL_PAGE_TYPE_SYS ).
FSP_IBUF_TREE_ROOT_PAGE_NO : root page of the change‑buffer B‑tree.
FSP_TRX_SYS_PAGE_NO / FSP_FIRST_RSEG_PAGE_NO : transaction system page storing transaction IDs, segment headers, rollback‑segment locations, double‑write buffer info, etc.
FSP_DICT_HDR_PAGE_NO : dictionary header page that holds metadata for system tables (e.g., SYS_TABLES , SYS_COLUMNS ).
Relevant creation functions are btr_create , trx_sysf_create , and dict_hdr_create .
External Storage Pages
Large column values may be stored in external pages. Three types exist:
FIL_PAGE_TYPE_BLOB : uncompressed external page.
FIL_PAGE_TYPE_ZBLOB : first page of a compressed blob chain.
FIL_PAGE_TYPE_ZBLOB2 : subsequent pages of a compressed blob chain.
Only a 20‑byte pointer is kept in the record to reference these pages.
Encrypted and R‑TREE Pages (MySQL 5.7)
MySQL 5.7 adds three encrypted page types: FIL_PAGE_ENCRYPTED , FIL_PAGE_COMPRESSED_AND_ENCRYPTED , and FIL_PAGE_ENCRYPTED_RTREE . Encryption is applied after compression (if any) using the functions os_file_encrypt_page → Encryption::encrypt and os_file_io_complete → Encryption::decrypt . Key information is stored in the first page of each .ibd file and can be rotated with ALTER INSTANCE ROTATE INNODB MASTER KEY .
Temporary Tablespace ibtmp
MySQL 5.7 introduces a dedicated temporary tablespace ( ibtmp1 ) for non‑compressed temporary tables. Its default size is 12 MB and can be changed via innodb_temp_data_file_path . The first 32 rollback segments also reside here (except segment 0, which stays in ibdata ).
Log Files (ib_logfile)
Redo logs now use CRC32 checksums controlled by innodb_log_checksums . A version header is stored at the beginning of each log file, updated by log_group_file_header_flush . Upgrading to 5.7 makes downgrades impossible without a clean shutdown.
IO Subsystem
InnoDB separates read and write operations. Synchronous reads/writes are performed directly by the calling thread, while asynchronous operations are queued in AIO task lists ( AIO::s_reads , AIO::s_writes , AIO::s_log , AIO::s_ibuf ) and processed by dedicated IO backend threads.
IO Backend Threads
IO READ threads – handle asynchronous file reads.
IO WRITE threads – handle asynchronous file writes.
LOG thread – writes checkpoint information.
IBUF thread – processes change‑buffer pages.
IO Request Initiation
The entry point is os_aio_func . For synchronous requests the thread calls os_file_read_func or os_file_write_func . For asynchronous requests the thread reserves a slot in the appropriate AIO queue, fills in file, offset, and data information, and may compress or encrypt the page before dispatch.
Asynchronous AIO Handling
IO threads invoke io_handler_thread → fil_aio_wait , which calls os_aio_handler . Native AIO uses os_aio_linux_handle and polls with io_getevents . Simulated AIO merges adjacent requests (disabled in 5.7) and processes slots based on age and offset.
IO Concurrency Control
File extensions set fil_node_t::being_extended to prevent concurrent extends. Deleting, truncating, or renaming a table sets flags ( stop_new_ops , is_being_truncated , etc.) that block new IO until pending operations complete.
File Read‑Ahead
Two read‑ahead strategies exist: random ( buf_read_ahead_random ) and linear ( buf_read_ahead_linear ), both triggered when a threshold of sequential page accesses is exceeded. Facebook also implements logical read‑ahead via row_search_for_mysql → row_read_ahead_logical .
Log Write Padding
To avoid read‑on‑write, MySQL 5.7 aligns redo‑log writes to the block size using the parameter innodb_log_write_ahead_size , padding the tail of the log with zeros (see log_write_up_to ).
Buffer Pool Memory Management
From 5.6 to 5.7 the buffer pool changed to support multiple chunks per instance (default 127 MB). Instances are allocated in whole‑chunk units, so configuring many instances can cause large memory overallocation.
Initialization and Structures
Each instance maintains several linked lists (LRU, free list, unzip LRU, etc.) and structures such as buf_pool_t , buf_page_t , and buf_block_t . The diagram of these objects is omitted for brevity.
Concurrency Control
Read operations acquire a shared block lock and increment buf_fix_count ; write operations acquire an exclusive lock. Flushes skip pages with non‑zero buf_fix_count . When multiple threads request the same page, the first thread loads it while others wait on the block’s X‑lock via buf_wait_for_read .
Page Eviction and Flush
If the free list is empty, InnoDB evicts pages using buf_LRU_free_from_unzip_LRU_list (for uncompressed pages) and buf_flush_ready_for_replace (for LRU pages). When many dirty pages exist, single‑page flush ( buf_flush_single_page_from_LRU ) or multi‑page cleaner threads are used to write them back.
Source: 云栖学院博客 (original article: https://yq.aliyun.com/articles/5586)
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.