Why a 4‑TB MySQL Instance Hangs on Startup and How to Fix It
A detailed forensic analysis reveals that a MySQL 5.7 instance stalls during startup due to a mismatch between InnoDB flush‑sync settings and truncate‑related redo logs, and the article explains how to diagnose the issue with TOP, stack traces, and source‑code inspection, then offers three practical work‑arounds.
The article describes a real‑world incident where a 4 TB MySQL replica failed to start after a forced kill‑9 restart, remaining stuck in an idle state with no new log entries. Initial diagnostics used top to confirm the process was idle (CPU 0 %) and stack traces to show that all I/O threads were idle while the main startup thread waited in nanosleep.
Root‑cause analysis
Log inspection revealed a Note about InnoDB completing a truncate operation during startup. Stack traces showed the flush thread waiting on buf_flush_event while the startup thread was blocked in buf_flush_wait_flushed. The mismatch stemmed from InnoDB treating pages generated by truncate_t::fixup_tables_in_non_system_tablespace as redo‑replay pages, causing the flush thread to consider its work done, whereas the startup thread still expected those pages to be flushed.
Further code review identified that the flush thread exits its loop when recv_sys->heap == NULL, which is cleared by recv_sys_debug_free after recv_recovery_from_checkpoint_finish. Because the srv_flush_sync (aka innodb_flush_sync) parameter was set to 0, the event buf_flush_event was never set, leaving the startup thread in a dead‑wait.
Solution approaches
Enable flush‑sync : Set innodb_flush_sync=ON (or srv_flush_sync=1) so that the flush thread sets buf_flush_event and the startup can proceed.
Force the wait loop to exit via GDB : Attach a debugger, modify the new_oldest argument from LSN_MAX to a lower value, causing the condition in buf_flush_wait_flushed to be satisfied and the loop to break.
Remove truncate logs : Delete the *_trunc.log files under the data directory before restart, preventing the problematic truncate fix‑up from being executed (use with caution as it physically alters data files).
After applying any of these fixes, the instance starts normally, and subsequent restarts no longer encounter the hang because the truncate‑related logs are cleared.
Additional observations
The bug appears specific to Percona‑modified MySQL 5.7.26/5.7.33; upstream MySQL does not exhibit the same behavior because its code path avoids the dead‑wait when new_oldest is LSN_MAX. The author suggests filing a bug report to upstream and emphasizes the value of DBA familiarity with source code for troubleshooting such deep‑seated issues.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
