How to Diagnose and Fix MySQL Replication Lag in Production
This guide explains why MySQL replication lag spikes, how to distinguish IO‑thread pull problems from SQL‑thread apply bottlenecks, provides step‑by‑step commands, configuration examples, real‑world case studies, best‑practice recommendations, and monitoring setups to reliably troubleshoot and prevent replication delays.
Overview
When replication lag spikes, Seconds_Behind_Master (or Seconds_Behind_Source in MySQL 8.0) only indicates that delay exists. Production‑grade diagnosis must first separate the IO thread and the SQL/Applier thread, then determine whether the bottleneck is network pull, relay‑log accumulation, slow apply, or a large transaction/DDL that blocks parallel replication.
Technical Characteristics
Replication chain: Source → Network → IO Thread → Relay Log → SQL/Applier Thread .
MySQL 8.0 uses SHOW REPLICA STATUS; older versions expose Seconds_Behind_Master.
Goal: detect and prevent long‑running transactions, DDL, or hotspot tables that stall replication.
Applicable Scenarios
Read traffic hits a replica and stale data appears after a sudden lag increase.
Relay log accumulates on the replica while the master continues writing.
Post‑deployment DDL or batch jobs cause minutes‑long lag.
Environment Requirements
MySQL 8.0+ (recommended).
GTID or traditional binlog position replication. mysqld_exporter (or equivalent) for metric collection.
Replication and performance_schema privileges.
Procedure
Preparation
System checks (run on the replica):
SHOW REPLICA STATUS\G
SHOW PROCESSLIST;
SHOW VARIABLES LIKE 'server_id';
SHOW VARIABLES LIKE 'gtid_mode';
SHOW VARIABLES LIKE 'slave_parallel_workers';Is the IO thread pulling binlog events?
Is the SQL/Applier thread executing?
Is the lag growing or catching up?
Is parallel replication enabled and fully utilized?
Install required client tools (apt or yum):
sudo apt update || true
sudo apt install -y mysql-client jq || true
sudo yum install -y mysql jq || trueFirst confirmation queries:
SHOW REPLICA STATUS\G
SELECT * FROM performance_schema.replication_connection_status\G
SELECT * FROM performance_schema.replication_applier_status_by_worker\GCore Configuration
Step 1 – Distinguish “pull‑slow” vs “apply‑slow”
Key status fields to examine:
Replica_IO_Running / Slave_IO_Running Replica_SQL_Running / Slave_SQL_Running Seconds_Behind_Source / Seconds_Behind_Master Relay_Log_Space Last_IO_Error, Last_SQL_ErrorIf the IO thread is down, run: SHOW REPLICA STATUS\G If the SQL thread is running but cannot keep up, inspect per‑worker apply progress:
SELECT WORKER_ID, LAST_APPLIED_TRANSACTION, APPLYING_TRANSACTION,
LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP
FROM performance_schema.replication_applier_status_by_worker;Step 2 – Standard configuration example
# /etc/my.cnf.d/replication.cnf
[mysqld]
server_id=102
log_bin=mysql-bin
binlog_format=ROW
gtid_mode=ON
enforce_gtid_consistency=ON
relay_log_recovery=ON
slave_parallel_type=LOGICAL_CLOCK
slave_parallel_workers=8
read_only=ON
super_read_only=ONImportant parameters: binlog_format=ROW – ensures row‑based replication consistency. relay_log_recovery=ON – automatically recovers the relay log after a crash. slave_parallel_workers=8 – number of parallel apply workers; must match the write pattern. super_read_only=ON – prevents accidental writes to the replica.
Step 3 – Three‑level deep dive
Inspect replication status and binary log:
SHOW REPLICA STATUS\G
SHOW BINARY LOG STATUS;Identify hot transactions and lock contention:
SHOW PROCESSLIST;
SELECT * FROM information_schema.innodb_trx\G
SHOW ENGINE INNODB STATUS\GDetect whether a single transaction is holding back replication:
SELECT THREAD_ID, EVENT_NAME, TIMER_WAIT
FROM performance_schema.events_stages_current
WHERE EVENT_NAME LIKE 'stage/sql/%';Start and Verify
STOP REPLICA;
START REPLICA;
SHOW REPLICA STATUS\GVerify by querying performance_schema.replication_connection_status and performance_schema.replication_applier_status_by_worker.
Helper Script
#!/usr/bin/env bash
set -euo pipefail
MYSQL_CMD="${MYSQL_CMD:-mysql -uroot -p}"
OUT_DIR="/tmp/mysql-replica-$(date +%F-%H%M%S)"
mkdir -p "$OUT_DIR"
$MYSQL_CMD -e "SHOW REPLICA STATUS\G" > "$OUT_DIR/replica-status.txt"
$MYSQL_CMD -e "SHOW PROCESSLIST" > "$OUT_DIR/processlist.txt"
$MYSQL_CMD -e "SELECT * FROM performance_schema.replication_connection_status\G" > "$OUT_DIR/connection-status.txt"
$MYSQL_CMD -e "SELECT * FROM performance_schema.replication_applier_status_by_worker\G" > "$OUT_DIR/applier-status.txt"
$MYSQL_CMD -e "SHOW ENGINE INNODB STATUS\G" > "$OUT_DIR/innodb-status.txt"
echo "artifacts saved to $OUT_DIR"Real‑World Cases
Case 1 – Large transaction blocks SQL thread
Symptoms: replica lag jumps from milliseconds to 18 minutes; master shows no errors; network is fine.
SHOW REPLICA STATUS\G
SHOW PROCESSLIST;
SELECT * FROM information_schema.innodb_trx\GResult shows Seconds_Behind_Source = 1087, both IO and SQL threads running, and a huge Relay_Log_Space. Root cause is a massive batch update on the primary that keeps the replica’s SQL thread busy.
Split the batch into smaller chunks.
Execute large‑table updates in segments.
Keep parallel replication enabled but understand it cannot split a single huge transaction.
Case 2 – IO thread cannot pull new binlog
Symptoms: lag grows while Relay_Log_Space stays small.
SHOW REPLICA STATUS\GInvestigation steps:
Check IO/SQL thread status.
Inspect Last_IO_Error and Last_SQL_Error fields.
Test network connectivity and replication account permissions:
mysql -h source-db -e "SHOW BINARY LOG STATUS;"
telnet source-db 3306Root cause: network jitter and permission issues caused the IO thread to reconnect repeatedly, accumulating lag.
Fix network stability or adjust firewall rules.
Ensure the replication account has the minimum required privileges.
Schedule heavy DDL or batch jobs during low‑traffic windows.
Case 3 – DDL and metadata lock stall SQL thread
Symptoms: after a schema change, replica lag stays > 40 minutes; IO thread runs, Relay Log grows, but SQL thread cannot catch up.
SHOW REPLICA STATUS\G
SELECT * FROM performance_schema.metadata_locks\G
SELECT * FROM information_schema.innodb_trx\G
SHOW PROCESSLIST;
SHOW ENGINE INNODB STATUS\GResult: "Waiting for table metadata lock". The primary’s DDL caused a long‑running query on the replica to hold a metadata lock.
Terminate the blocking query.
Run DDL on large tables during off‑peak periods.
Pre‑test DDL impact on replication in a staging replica.
Best Practices
Enable parallel replication but verify that the write pattern actually benefits; large transactions do not magically speed up.
Throttle batch jobs, DDL, and huge transactions; use maintenance windows.
Monitor not only Seconds_Behind_* but also IO/SQL thread health and Relay Log growth.
Set super_read_only=ON on replicas and grant the replication account only minimal privileges.
Avoid binding all read traffic to a single replica; enable automatic read‑traffic failover when lag spikes.
Even if Seconds_Behind_Master or Seconds_Behind_Source is zero, the replica may still be unhealthy (threads disconnected, temporary catch‑up, etc.).
Monitoring and Alerting
Log inspection (example for Linux):
grep -Ei 'replica|slave|relay|error' /var/log/mysqld.log | tail -50Prometheus alert rules (example):
groups:
- name: mysql-replication
rules:
- alert: MySQLReplicationLagHigh
expr: mysql_slave_status_seconds_behind_master > 10
for: 5m
- alert: MySQLReplicationSQLThreadDown
expr: mysql_slave_status_sql_running == 0
for: 1m
- alert: MySQLReplicationIOThreadDown
expr: mysql_slave_status_slave_io_running == 0
for: 1mBackup and Recovery
#!/usr/bin/env bash
set -euo pipefail
mysql -uroot -p -e "SHOW REPLICA STATUS\G" > /backup/replica-status-$(date +%F).txt
mysql -uroot -p -e "SHOW MASTER STATUS\G" > /backup/master-status-$(date +%F).txtCollect current state (e.g., run the helper script).
STOP REPLICA;
Fix the root cause (e.g., kill long queries, adjust DDL schedule).
START REPLICA; verify catch‑up with SHOW REPLICA STATUS\G.
Conclusion
Do not rely solely on replication‑lag seconds.
First separate IO and SQL thread health, then examine Relay Log size and transaction characteristics.
Large transactions, DDL, and hotspot tables are the most frequent lag amplifiers.
Parallel replication helps but cannot split a single massive transaction.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
