How InnoDB Recovers After a Crash: Deep Dive into Redo, Binlog, and Undo Logs
After an unexpected crash, InnoDB restores data using a multi‑stage process that first replays redo logs based on checkpoints, then leverages binlog and undo logs to resolve uncommitted transactions, with detailed steps, optimizations, and checkpoint handling explained.
1. Redo‑log based recovery
When InnoDB starts after an unexpected crash it reads the most recent checkpoint stored in the first 2048 bytes of ib_logfile0. Two alternating checkpoints are kept; the newer one is identified by its checkpoint number. From the checkpoint the engine obtains the Log Sequence Number (LSN) and the offset inside the redo‑log file where recovery must begin.
checkpoint no : identifier of the newer checkpoint (two checkpoints alternate).
checkpoint lsn : LSN of the flush that created the checkpoint; all pages with LSN ≤ this value are guaranteed to be on disk.
checkpoint offset : byte offset in the redo‑log file where the recovery scan starts.
The redo‑log is scanned in three passes (MySQL 5.7 and later):
First pass locates the MLOG_CHECKPOINT record. If it is missing, no recovery is needed.
Second pass parses redo records and inserts them into a hash table recv_sys->addr_hash keyed by (space, offset). If the hash table fills before reaching end‑of‑file the third pass is skipped.
Third pass continues parsing until the hash table is full and all redo records have been applied.
During parsing each 512‑byte redo block is read in chunks of 4 × page_size (default page size = 16 KB → 64 KB per read). The relevant part of each block (the body) together with its (space, offset) key is stored in recv_sys->buf and then inserted into the hash table. Collisions are resolved with linked lists, allowing multiple bodies for the same key.
After the hash table is built, InnoDB iterates over it, reads the corresponding data pages from the tablespace files, and applies the redo operations, thereby persisting modifications that were only in the log.
Optimisation 1 : When a page is fetched into the buffer pool, InnoDB also pre‑fetches the 32 neighbouring pages, based on the assumption that nearby pages are likely to be needed soon.
Optimisation 2 : Prior to MySQL 5.7 the recovery process relied on the data dictionary to map space IDs to .ibd files, requiring all tablespaces to be opened. Starting with 5.7 the redo log contains two new record types— MLOG_FILE_NAME (stores space and file path) and MLOG_CHECKPOINT (marks the end of the file‑name list). This allows recovery to open only the tablespaces referenced in the redo log, eliminating the dictionary dependency. Multiple MLOG_CHECKPOINT records after a checkpoint indicate redo‑log corruption.
2. Binlog and undo‑log participation
The second recovery stage handles transactions that were written to the binary log but whose changes are not reflected in the redo log (e.g., a crash occurred after the binlog write but before the redo flush).
Read the latest binary log file and collect all transaction IDs (XIDs) that appear, building an xid_list.
Scan the undo logs to reconstruct the list of uncommitted transactions, producing an undo_list. InnoDB maintains 128 rollback segments; each segment points to undo‑log pages. By traversing the undo slots the engine builds trx_sys->trx_list, which contains all transactions that have not been committed.
Decision rule: if a transaction’s XID is present in the xid_list extracted from the binlog, the transaction must be committed; otherwise it is rolled back. This guarantees consistency between master and replica after recovery.
3. Potential further optimisations
After the hash table is populated, recovery of independent hash nodes could be parallelised because each node corresponds to a distinct (space, offset) key. Additionally, the pre‑fetch of 32 contiguous pages could be replaced by a red‑black tree ordered by (space, offset) , allowing the engine to read only the pages that are actually required.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
