Why MySQL Relay Logs Lost Thousands of Rows: A Real‑World Investigation
A senior MySQL engineer recounts a puzzling case where a custom script flushing relay logs caused the first transaction in each relay log to be skipped, leading to daily loss of nearly ten thousand rows in a rapidly growing production environment.
In a fast‑growing service where user count surged from a few thousand to hundreds of thousands, the MySQL master‑slave replication began losing data. One day the master (10.0.0.1) spiked to over 80% CPU, causing Keepalived to move the VIP to the slave (10.0.0.2). Only a few seconds of data were expected to be lost, but later the business reported missing user records that prevented logins.
Initial Investigation
The engineer selected a recent missing row (ID 849791, timestamp 2017‑06‑09 13:36:01) and checked the master’s binlog, confirming the INSERT existed. However, the slave’s binlog did not contain the record, and the relay logs had been purged due to relay_log_purge and log‑slave‑updates settings.
Data Pattern Analysis
By extracting all lost rows, it was observed that each missing entry was inserted within the first two seconds of a minute, and the loss occurred intermittently every few days.
Deep Dive into Relay Logs
The team disabled log‑slave‑updates and relay_log_purge to obtain clean logs. Examination showed that relay logs were generated every minute, each with varying file sizes, indicating a script that forcibly split relay logs.
Searching the relay logs for the missing timestamps revealed that the first transaction of each relay log was absent, while later transactions were present. Extending the search window to 50 seconds still showed no record, confirming that the first transaction was being dropped.
Root Cause Identification
Further random checks of multiple relay logs showed a consistent pattern: the first transaction—whether INSERT, UPDATE, or other—was never executed on the slave. The suspicion fell on the custom script used to split relay logs, which employed unconventional commands and sql_slave_skip_counter to force log rotation.
Final Resolution
By enabling the general query log on the slave and monitoring the first few seconds of each minute, the engineer captured the exact commands the script used to skip transactions. The script’s use of STOP SLAVE; START SLAVE; combined with sql_slave_skip_counter caused the first transaction in every newly created relay log to be ignored, leading to the massive data loss.
Removing the faulty script and restoring proper relay‑log handling eliminated the loss, highlighting the danger of “creative” operational shortcuts in replication environments.
Takeaway
Even well‑intentioned automation can break fundamental replication guarantees; always verify that scripts manipulating relay logs do not skip or discard transactions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
