Why Does MySQL Hang on Startup? Deep Dive into InnoDB Truncate Bug and Fix
A MySQL DBA recounts a 4‑TB InnoDB instance that stalled during startup, analyzes top, stack traces, and source code to pinpoint a bug involving truncate operations and the buf_flush_event, then presents three practical solutions—including adjusting innodb_flush_sync, using GDB to break the wait loop, or removing truncate logs—to restore normal operation.
Background
A colleague killed a MySQL process and restarted a replica with a data size approaching 4 TB. The instance failed to start, remaining in a dead‑locked state with no new log entries after the last Note‑level message.
Initial Analysis
Using top showed the MySQL process (PID 84448) with 0 % CPU, indicating it was idle rather than in a busy loop.
Stack traces revealed that all I/O threads were idle and the main startup thread was stuck in nanosleep. Both major thread groups were idle, matching the top observation.
Further Investigation
The stack showed the flush thread buf_flush_page_cleaner_coordinator waiting, while the startup thread called buf_flush_wait_flushed. The log contained a Note about InnoDB completing a truncate for a table, suggesting a link between the truncate operation and the hang.
Examining the source, the function truncate_t::fixup_tables_in_non_system_tablespace is invoked during startup, which eventually calls log_make_checkpoint_at, generating many dirty pages that need flushing.
Source Code Analysis
In the Percona 5.7.26 build, the flush thread waits on os_event_wait(buf_flush_event) at line 3212 of buf0flu.cc. This event is set later in srv0start.cc:2892, but the startup thread reaches the truncate fix‑up before the event is set, causing a mismatch: the flush thread believes REDO replay is complete and stays idle, while the startup thread expects more pages to be flushed.
The condition that releases the wait loop is recv_sys->heap == NULL, which becomes true only after recv_recovery_from_checkpoint_finish runs. Because the truncate fix‑up occurs after this point, the flush thread never receives the signal, leading to an indefinite wait.
Two places set buf_flush_event: one in srv0start.cc and another in buf_flush_request_force. The latter is guarded by the parameter srv_flush_sync (mapped to innodb_flush_sync). In the observed case, srv_flush_sync was 0, preventing the event from being set.
Solutions
Enable innodb_flush_sync : Set innodb_flush_sync=ON to ensure the event is set and the flush thread proceeds.
Break the wait loop with GDB : Attach GDB to the stalled process and modify the new_oldest argument from LSN_MAX to a value that satisfies the loop exit condition, allowing the startup to complete.
Remove truncate logs : Delete the *_trunc.log files under the data directory before restart, preventing the truncate fix‑up from being triggered. This is risky and should be used with caution.
Further Questions
After applying one of the fixes, the instance starts normally. Subsequent shutdowns and restarts do not reproduce the issue because the truncate logs have been cleared, so truncate_t::fixup_tables_in_non_system_tablespace is no longer invoked.
Conclusion
The root cause is a bug in Percona’s InnoDB implementation where the flush thread and startup thread have inconsistent expectations about dirty pages after a truncate operation. The bug can be mitigated by enabling innodb_flush_sync, manually breaking the wait loop, or removing the problematic truncate logs. Reporting the bug to the upstream MySQL team is advisable, but DBAs should also understand the source code to diagnose similar issues.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
