How to Systematically Diagnose and Fix MySQL Master‑Slave Replication Lag
This article walks through a real‑world case of MySQL 5.7 replication delay after switching to TokuDB, presenting a step‑by‑step methodology that examines network, hardware, large transactions, locks, configuration parameters, and multi‑threaded replication to identify and resolve the root cause.
As a DBA, you often encounter MySQL master‑slave replication lag caused by various factors such as network issues, bandwidth limits, large transactions, or single‑threaded replication. The author shares a recent, typical case involving a monitoring database that imports massive data via LOAD DATA each minute. After converting the InnoDB tables to TokuDB on MySQL 5.7, the replica began lagging by about 50 binlog files (≈7.5 hours) while the master generated roughly 160 binlogs daily.
Background
The database stores monitoring data; after the engine switch, daily lag grew steadily. Binlog statistics are shown in the accompanying chart.
Thinking Process
The author emphasizes a systematic troubleshooting mindset: generate hypotheses, verify them with evidence, and eliminate possibilities. The following potential causes were examined:
Network : Bandwidth saturation or high latency could delay binlog transmission. In this case, the IO thread already pulled binlogs in near real‑time, and network monitoring confirmed no issue.
Machine Performance :
Is the replica using inferior hardware? The author notes that using SSD on the master while the replica runs on SATA can cause lag.
Is the replica overloaded? High CPU or I/O load from analytics can be detected with top.
Disk problems? Issues with RAID cards or scheduler settings can cause high I/O latency. iostat showed low latency and IOPS, and disk scheduler settings matched the master, so disk was ruled out.
Large Transactions : Big DELETE statements, ALTER operations, or statements with LIMIT can stall replication. Processlist and mysqlbinlog checks found no such transactions.
Locks : Row‑level locks or use of MyISAM could slow the SQL thread. Examination of information_schema revealed no lock contention.
Parameters : For InnoDB, tuning innodb_flush_log_at_trx_commit and sync_binlog helps. With TokuDB, the relevant variables are tokudb_commit_sync, tokudb_fsync_log_period, and sync_binlog. Adjustments can improve replication speed but may affect data safety.
Multi‑Threaded Replication : MySQL 5.6+ supports parallel replication. The author first verified the number of SQL threads with SHOW PROCESSLIST and SHOW VARIABLES LIKE '%slave_parallel%'. The replica was already configured for multi‑threading, yet only one thread was active.
Measuring Thread Utilization
To quantify thread usage, the author enabled performance‑schema consumers and created a view to aggregate transaction counts per replication worker:
UPDATE performance_schema.setup_consumers SET ENABLED = 'YES' WHERE NAME LIKE 'events_transactions%';
UPDATE performance_schema.setup_instruments SET ENABLED = 'YES', TIMED = 'YES' WHERE NAME = 'transaction';
CREATE VIEW rep_thread_count AS
SELECT a.THREAD_ID, a.COUNT_STAR
FROM performance_schema.events_transactions_summary_by_thread_by_event_name a
WHERE a.THREAD_ID IN (SELECT b.THREAD_ID FROM performance_schema.replication_applier_status_by_worker b);
SELECT SUM(COUNT_STAR) FROM rep_thread_count INTO @total;
SELECT 100*(COUNT_STAR/@total) AS thread_usage FROM rep_thread_count;The results showed a single thread handling the majority of work, confirming a bottleneck.
Group Commit Optimization
Understanding MySQL 5.7's logical‑clock parallel replication, the author adjusted group‑commit parameters to increase the chance that multiple transactions commit simultaneously:
SET GLOBAL binlog_group_commit_sync_delay = 1000000; -- 1 second (microseconds)
SET GLOBAL binlog_group_commit_sync_no_delay_count = 20;These settings force a commit after 20 transactions even if the delay threshold hasn't been reached, reducing latency for high‑throughput load‑data scenarios. After applying the changes and increasing slave_parallel_workers to 16, thread utilization rose dramatically, and replication lag disappeared, as confirmed by SHOW SLAVE STATUS.
Conclusion
When facing MySQL replication lag, consider the following checklist:
Network conditions
Hardware performance
Configuration parameters (especially binlog and engine‑specific settings)
Large transactions
Lock contention
Parallel replication setup and thread utilization
Group commit tuning
Applying a systematic, hypothesis‑driven approach can quickly pinpoint the root cause and restore real‑time replication.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
