Databases 12 min read

How to Systematically Diagnose and Fix MySQL Master‑Slave Replication Lag

This article walks through a real‑world case of MySQL 5.7 replication delay after switching to TokuDB, presenting a step‑by‑step methodology that examines network, hardware, large transactions, locks, configuration parameters, and multi‑threaded replication to identify and resolve the root cause.

dbaplus Community
dbaplus Community
dbaplus Community
How to Systematically Diagnose and Fix MySQL Master‑Slave Replication Lag

As a DBA, you often encounter MySQL master‑slave replication lag caused by various factors such as network issues, bandwidth limits, large transactions, or single‑threaded replication. The author shares a recent, typical case involving a monitoring database that imports massive data via LOAD DATA each minute. After converting the InnoDB tables to TokuDB on MySQL 5.7, the replica began lagging by about 50 binlog files (≈7.5 hours) while the master generated roughly 160 binlogs daily.

Background

The database stores monitoring data; after the engine switch, daily lag grew steadily. Binlog statistics are shown in the accompanying chart.

Thinking Process

The author emphasizes a systematic troubleshooting mindset: generate hypotheses, verify them with evidence, and eliminate possibilities. The following potential causes were examined:

Network : Bandwidth saturation or high latency could delay binlog transmission. In this case, the IO thread already pulled binlogs in near real‑time, and network monitoring confirmed no issue.

Machine Performance :

Is the replica using inferior hardware? The author notes that using SSD on the master while the replica runs on SATA can cause lag.

Is the replica overloaded? High CPU or I/O load from analytics can be detected with top.

Disk problems? Issues with RAID cards or scheduler settings can cause high I/O latency. iostat showed low latency and IOPS, and disk scheduler settings matched the master, so disk was ruled out.

Large Transactions : Big DELETE statements, ALTER operations, or statements with LIMIT can stall replication. Processlist and mysqlbinlog checks found no such transactions.

Locks : Row‑level locks or use of MyISAM could slow the SQL thread. Examination of information_schema revealed no lock contention.

Parameters : For InnoDB, tuning innodb_flush_log_at_trx_commit and sync_binlog helps. With TokuDB, the relevant variables are tokudb_commit_sync, tokudb_fsync_log_period, and sync_binlog. Adjustments can improve replication speed but may affect data safety.

Multi‑Threaded Replication : MySQL 5.6+ supports parallel replication. The author first verified the number of SQL threads with SHOW PROCESSLIST and SHOW VARIABLES LIKE '%slave_parallel%'. The replica was already configured for multi‑threading, yet only one thread was active.

Measuring Thread Utilization

To quantify thread usage, the author enabled performance‑schema consumers and created a view to aggregate transaction counts per replication worker:

UPDATE performance_schema.setup_consumers SET ENABLED = 'YES' WHERE NAME LIKE 'events_transactions%';
UPDATE performance_schema.setup_instruments SET ENABLED = 'YES', TIMED = 'YES' WHERE NAME = 'transaction';
CREATE VIEW rep_thread_count AS
  SELECT a.THREAD_ID, a.COUNT_STAR
  FROM performance_schema.events_transactions_summary_by_thread_by_event_name a
  WHERE a.THREAD_ID IN (SELECT b.THREAD_ID FROM performance_schema.replication_applier_status_by_worker b);
SELECT SUM(COUNT_STAR) FROM rep_thread_count INTO @total;
SELECT 100*(COUNT_STAR/@total) AS thread_usage FROM rep_thread_count;

The results showed a single thread handling the majority of work, confirming a bottleneck.

Group Commit Optimization

Understanding MySQL 5.7's logical‑clock parallel replication, the author adjusted group‑commit parameters to increase the chance that multiple transactions commit simultaneously:

SET GLOBAL binlog_group_commit_sync_delay = 1000000;  -- 1 second (microseconds)
SET GLOBAL binlog_group_commit_sync_no_delay_count = 20;

These settings force a commit after 20 transactions even if the delay threshold hasn't been reached, reducing latency for high‑throughput load‑data scenarios. After applying the changes and increasing slave_parallel_workers to 16, thread utilization rose dramatically, and replication lag disappeared, as confirmed by SHOW SLAVE STATUS.

Conclusion

When facing MySQL replication lag, consider the following checklist:

Network conditions

Hardware performance

Configuration parameters (especially binlog and engine‑specific settings)

Large transactions

Lock contention

Parallel replication setup and thread utilization

Group commit tuning

Applying a systematic, hypothesis‑driven approach can quickly pinpoint the root cause and restore real‑time replication.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

mysqlReplicationPerformanceTuningDBATokuDBLagParallelReplication
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.