Inside MySQL InnoDB Buffer Pool: Architecture, Data Structures, and Optimization
This article provides an in‑depth technical walkthrough of MySQL InnoDB's Buffer Pool, covering its core data structures, instance layout, LRU and Flush list management, memory allocation strategies, read‑ahead/write‑ahead mechanisms, double‑write buffering, and the specialized threads that keep the pool efficient.
Introduction
The Buffer Pool is essential for MySQL performance, acting as an intermediate cache between fast memory and slow storage. This article analyzes the Buffer Pool implementation in the Alibaba Cloud RDS MySQL 5.6 branch (derived from AliSQL), whose source resides in the buf directory and spans roughly 20,000 lines of code.
Fundamental Knowledge
Buffer Pool Instance : Each instance equals innodb_buffer_pool_size / innodb_buffer_pool_instances and owns its own locks, semaphores, buffer chunks, and logical lists, eliminating contention between instances. When the total size is under 1 GB, the instance count is forced to 1.
Data Page : The smallest unit of storage (default 16 KB). Compressed pages (4 KB–16 KB) are decompressed on read, and both compressed and decompressed pages are cached. The LRU list may evict either type depending on I/O pressure.
Buffer Chunks : Physical blocks that hold a data page and its control structure. They are allocated at startup and released only on shutdown.
Logical Lists : Various linked lists (Free List, LRU List, FLU List, Quick List, etc.) that organize pages for allocation, eviction, and flushing.
Core Data Structures
The three central structures are buf_pool_t, buf_block_t, and buf_page_t. buf_pool_t stores instance‑level information such as mutexes, page hash, and the roots of all logical lists, including the two‑dimensional Zip Free array. buf_block_t is the control body of a page. Its first field is a buf_page_t (required for pointer conversion). It also holds a frame pointer to the actual page data and a block‑level mutex. buf_page_t contains most page metadata (space_id, page_no, state, modification timestamps, compression info, etc.). The state field can be one of eight values, each representing a distinct lifecycle stage.
Buffer Pool Memory Initialization
Memory is allocated via buf_chunk_init and os_mem_alloc_large. Two allocation paths exist: HugeTLB (2 MB pages) and traditional mmap. After allocation, the memory is split into control structures ( buf_block_t, 424 bytes each) and the actual data pages (e.g., 16 KB). Control structures occupy roughly 1 GB for a 40 GB pool.
After initialization, each buf_block_t is linked into the Free List, and auxiliary structures (watch pages, page hash, zip hash) are set up.
Buf_page_get Function Analysis
Buf_page_getis a macro that expands to buf_page_get_gen. It accepts space_id, page_no, lock_type, mode, and mtr. The mode argument determines the retrieval strategy (e.g., BUF_GET, BUF_GET_IF_IN_POOL, BUF_PEEK_IF_IN_POOL, etc.).
The function first computes the target instance using (space_id << 20 + space_id + page_no >> 6) % instance_num, then looks up the page hash. Depending on the mode, it may set a watch, return NULL, or read the page from disk.
When reading from disk, free pages are taken from the Free List (or flushed if none are available). Compressed pages receive a temporary buf_page_t, later replaced by a permanent block via buf_relocate. After successful I/O, the page may be moved to the young list, added to the Quick List, or have its access time updated.
Young List and Old List Maintenance
The LRU List is split into a young list (hot pages) and an old list (cold pages) when its length exceeds 512 entries. The split ratio is controlled by innodb_old_blocks_pct. New pages start at the head of the old list; after innodb_old_blocks_time they may be promoted to the young list if accessed again. Pages deep in the young list are only moved to the head when they occupy the last quarter of the list, reducing churn.
buf_LRU_get_free_block Analysis
This function locates a free page for allocation. It first checks the Quick List (for ENGINE_NO_CACHE queries), then the Free List. If the Free List is empty, it attempts to evict pages from the LRU List, preferring uncompressed pages unless the Unzip LRU is large or I/O pressure is high. Eviction respects page state (dirty pages are not evicted) and may involve moving compressed pages to the Zip Free array.
If eviction fails, a single dirty page is flushed via buf_flush_single_page_from_LRU. The algorithm may iterate up to three times, expanding the scan depth each round, and finally sleeps briefly before retrying.
Scanning without Caching to Buffer Pool
Full‑table scans can pollute the Buffer Pool. Alibaba Cloud RDS introduces the ENGINE_NO_CACHE hint, which directs pages read by the statement into a temporary Quick List that is cleared when the statement finishes. Parameters innodb_rds_trx_own_block_max and innodb_rds_quick_lru_limit_per_instance limit per‑transaction page usage and Quick List length.
Removing Pages of a Specific Tablespace
The function buf_LRU_remove_pages supports three modes: BUF_REMOVE_ALL_NO_WRITE: removes all pages of the tablespace from LRU and Flush lists without writing back (used for RENAME TABLE). BUF_REMOVE_FLUSH_NO_WRITE: removes only Flush‑list pages without writing (used for DROP TABLE). BUF_REMOVE_FLUSH_WRITE: flushes dirty pages but does not remove any list entries (used during normal shutdown).
To avoid long pauses, the operation releases the instance mutex after every BUF_LRU_DROP_SEARCH_SIZE (default 1024) pages processed.
LRU Manager Thread
This background thread runs at startup and periodically moves a configurable number of pages ( innodb_LRU_scan_depth per instance) from the LRU List to the Free List. Its sleep interval adapts to the current free‑list length, ensuring enough free pages under heavy load.
Hazard Pointer
Hazard Pointers allow a thread to safely traverse the Flush List without holding the lock for the entire scan. The pointer is set to the next node before releasing the lock, enabling lock‑free progression even when I/O is slow, thus keeping the eviction algorithm O(N) instead of O(N²).
Page Cleaner Thread
Another background thread flushes dirty pages from the Flush List. Its adaptive sleep time depends on the gap between the current LSN and the oldest modification, as well as the configured innodb_io_capacity limits. The thread can be forced to a fixed 1‑second interval via rds_page_cleaner_adaptive_sleep.
Read‑Ahead and Write‑Ahead
InnoDB implements two read‑ahead strategies:
Random read‑ahead ( buf_read_ahead_random) triggers when a hot page in the first quarter of the young list is read; up to 13 neighboring pages in the same extent are fetched asynchronously.
Linear read‑ahead ( buf_read_ahead_linear) activates on a boundary page whose surrounding pages show strictly increasing or decreasing access times, pulling the next extent into memory.
Write‑ahead (neighbor flushing) is performed by buf_flush_try_neighbors and can be disabled with innodb_flush_neighbors on SSDs.
Double Write Buffer (dblwr)
The double‑write buffer (default 2 MB, 128 pages) protects against half‑written pages on power loss. Writes are first staged in the buffer; when full, a batch is synchronously flushed to the tablespace, followed by asynchronous per‑page writes. Parameters such as innodb_doublewrite_batch_size control the batch size.
Buddy System
Compressed pages are allocated via a buddy allocator ( buf_buddy_alloc) that maintains free lists for 1 K, 2 K, 4 K, 8 K, and 16 K blocks. Allocation may split larger blocks, while buf_buddy_free merges adjacent free blocks, optionally relocating buddies to reduce fragmentation.
Buffer Pool Warmup
MySQL 5.6 supports dumping the current Buffer Pool state ( buf_dump) to a file and reloading it on restart ( buf_load), allowing a fast warm‑up by pre‑populating pages based on space_id and page_no.
Conclusion
InnoDB's Buffer Pool combines a classic LRU/Flush design with numerous optimizations—per‑instance hashing, compressed‑page handling, hazard pointers, adaptive background threads, and buddy allocation—making its implementation intricate but highly performant.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
