InnoDB Startup Hang Analysis: Truncate, Flush Sync, and Debugging Strategies
An in-depth investigation of a MySQL InnoDB startup deadlock caused by truncate operations and the innodb_flush_sync parameter, detailing log observations, stack traces, source code examination, and three practical solutions including GDB intervention, parameter adjustment, and data directory cleanup.
Author Wang Zhufeng, database director at Qunar.com, specializes in MySQL source code research and has contributed to the Inception open‑source project and the book "MySQL Operations Reference".
Background : A colleague killed a stalled MySQL instance (nearly 4 TB) with kill -9 and attempted to restart it, but the server remained in a dead‑locked state with no further log output.
Initial Analysis : Using the top command, the process (PID 84448) showed 0 % CPU, indicating it was idle rather than in a busy loop.
Stack Inspection : The stack trace revealed that all I/O threads were idle and the startup thread ( mysqld_main ) was stuck in nanosleep , confirming the instance was doing nothing.
Further Analysis : The log contained a note about InnoDB completing a truncate operation during startup. The function truncate_t::fixup_tables_in_non_system_tablespace was identified as the point where the deadlock originated, suggesting a mismatch between the redo‑playback phase and the normal flush phase.
Source‑Code Investigation : By compiling a debug build of the same MySQL version and reproducing the issue, the author pinpointed the exact line os_event_wait(buf_flush_event) in buf0flu.cc:3212 as the waiting point. The startup thread called innobase_start_or_create_for_mysql , which in turn invoked the truncate fix‑up function, leading to dirty pages that required flushing.
Root Cause : The flush thread considered the dirty pages part of redo replay and entered idle state, while the startup thread expected them to be flushed, causing both threads to wait indefinitely. The parameter srv_flush_sync (mapped to innodb_flush_sync ) was set to 0, preventing the flush event from being signaled.
Solution 1 : Use GDB to modify the waiting condition (e.g., adjust the new_oldest LSN) so the loop exits, instantly unblocking the startup.
Solution 2 : Before restarting, set innodb_flush_sync=ON so the flush thread will be signaled correctly and the startup completes normally.
Solution 3 : Delete the *_trunc.log files under the data directory to prevent the truncate fix‑up from being invoked on the next start (use with caution).
Further Questions : After a successful restart, subsequent shutdowns and restarts no longer encounter the issue because the truncate logs have been cleared.
Clarification : The bug appears to be introduced by Percona's modifications (e.g., version 5.7.26) and is not present in the upstream MySQL source, where the code path avoids the deadlock.
Conclusion : This is a clear InnoDB bug that can be mitigated by adjusting innodb_flush_sync or using debugging techniques; reporting the bug to the upstream project is recommended. The case also highlights the importance for DBAs to understand database source code.
Recruitment Notice : Qunar.com is hiring for positions such as "Security R&D Engineer", "PostgreSQL DBA", and "Cloud Platform Development Engineer" with competitive compensation, benefits, and a strong technical team.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.