Databases 16 min read

How to Diagnose and Fix MySQL Replication Lag in Production

This guide explains why MySQL replication lag spikes, how to distinguish IO‑thread pull problems from SQL‑thread apply bottlenecks, provides step‑by‑step commands, configuration examples, real‑world case studies, best‑practice recommendations, and monitoring setups to reliably troubleshoot and prevent replication delays.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How to Diagnose and Fix MySQL Replication Lag in Production

Overview

When replication lag spikes, Seconds_Behind_Master (or Seconds_Behind_Source in MySQL 8.0) only indicates that delay exists. Production‑grade diagnosis must first separate the IO thread and the SQL/Applier thread, then determine whether the bottleneck is network pull, relay‑log accumulation, slow apply, or a large transaction/DDL that blocks parallel replication.

Technical Characteristics

Replication chain: Source → Network → IO Thread → Relay Log → SQL/Applier Thread .

MySQL 8.0 uses SHOW REPLICA STATUS; older versions expose Seconds_Behind_Master.

Goal: detect and prevent long‑running transactions, DDL, or hotspot tables that stall replication.

Applicable Scenarios

Read traffic hits a replica and stale data appears after a sudden lag increase.

Relay log accumulates on the replica while the master continues writing.

Post‑deployment DDL or batch jobs cause minutes‑long lag.

Environment Requirements

MySQL 8.0+ (recommended).

GTID or traditional binlog position replication. mysqld_exporter (or equivalent) for metric collection.

Replication and performance_schema privileges.

Procedure

Preparation

System checks (run on the replica):

SHOW REPLICA STATUS\G
SHOW PROCESSLIST;
SHOW VARIABLES LIKE 'server_id';
SHOW VARIABLES LIKE 'gtid_mode';
SHOW VARIABLES LIKE 'slave_parallel_workers';

Is the IO thread pulling binlog events?

Is the SQL/Applier thread executing?

Is the lag growing or catching up?

Is parallel replication enabled and fully utilized?

Install required client tools (apt or yum):

sudo apt update || true
sudo apt install -y mysql-client jq || true
sudo yum install -y mysql jq || true

First confirmation queries:

SHOW REPLICA STATUS\G
SELECT * FROM performance_schema.replication_connection_status\G
SELECT * FROM performance_schema.replication_applier_status_by_worker\G

Core Configuration

Step 1 – Distinguish “pull‑slow” vs “apply‑slow”

Key status fields to examine:

Replica_IO_Running / Slave_IO_Running
Replica_SQL_Running / Slave_SQL_Running
Seconds_Behind_Source / Seconds_Behind_Master
Relay_Log_Space
Last_IO_Error, Last_SQL_Error

If the IO thread is down, run: SHOW REPLICA STATUS\G If the SQL thread is running but cannot keep up, inspect per‑worker apply progress:

SELECT WORKER_ID, LAST_APPLIED_TRANSACTION, APPLYING_TRANSACTION,
       LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
       LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP
FROM performance_schema.replication_applier_status_by_worker;

Step 2 – Standard configuration example

# /etc/my.cnf.d/replication.cnf
[mysqld]
server_id=102
log_bin=mysql-bin
binlog_format=ROW
gtid_mode=ON
enforce_gtid_consistency=ON
relay_log_recovery=ON
slave_parallel_type=LOGICAL_CLOCK
slave_parallel_workers=8
read_only=ON
super_read_only=ON

Important parameters: binlog_format=ROW – ensures row‑based replication consistency. relay_log_recovery=ON – automatically recovers the relay log after a crash. slave_parallel_workers=8 – number of parallel apply workers; must match the write pattern. super_read_only=ON – prevents accidental writes to the replica.

Step 3 – Three‑level deep dive

Inspect replication status and binary log:

SHOW REPLICA STATUS\G
SHOW BINARY LOG STATUS;

Identify hot transactions and lock contention:

SHOW PROCESSLIST;
SELECT * FROM information_schema.innodb_trx\G
SHOW ENGINE INNODB STATUS\G

Detect whether a single transaction is holding back replication:

SELECT THREAD_ID, EVENT_NAME, TIMER_WAIT
FROM performance_schema.events_stages_current
WHERE EVENT_NAME LIKE 'stage/sql/%';

Start and Verify

STOP REPLICA;
START REPLICA;
SHOW REPLICA STATUS\G

Verify by querying performance_schema.replication_connection_status and performance_schema.replication_applier_status_by_worker.

Helper Script

#!/usr/bin/env bash
set -euo pipefail

MYSQL_CMD="${MYSQL_CMD:-mysql -uroot -p}"
OUT_DIR="/tmp/mysql-replica-$(date +%F-%H%M%S)"
mkdir -p "$OUT_DIR"

$MYSQL_CMD -e "SHOW REPLICA STATUS\G" > "$OUT_DIR/replica-status.txt"
$MYSQL_CMD -e "SHOW PROCESSLIST" > "$OUT_DIR/processlist.txt"
$MYSQL_CMD -e "SELECT * FROM performance_schema.replication_connection_status\G" > "$OUT_DIR/connection-status.txt"
$MYSQL_CMD -e "SELECT * FROM performance_schema.replication_applier_status_by_worker\G" > "$OUT_DIR/applier-status.txt"
$MYSQL_CMD -e "SHOW ENGINE INNODB STATUS\G" > "$OUT_DIR/innodb-status.txt"

echo "artifacts saved to $OUT_DIR"

Real‑World Cases

Case 1 – Large transaction blocks SQL thread

Symptoms: replica lag jumps from milliseconds to 18 minutes; master shows no errors; network is fine.

SHOW REPLICA STATUS\G
SHOW PROCESSLIST;
SELECT * FROM information_schema.innodb_trx\G

Result shows Seconds_Behind_Source = 1087, both IO and SQL threads running, and a huge Relay_Log_Space. Root cause is a massive batch update on the primary that keeps the replica’s SQL thread busy.

Split the batch into smaller chunks.

Execute large‑table updates in segments.

Keep parallel replication enabled but understand it cannot split a single huge transaction.

Case 2 – IO thread cannot pull new binlog

Symptoms: lag grows while Relay_Log_Space stays small.

SHOW REPLICA STATUS\G

Investigation steps:

Check IO/SQL thread status.

Inspect Last_IO_Error and Last_SQL_Error fields.

Test network connectivity and replication account permissions:

mysql -h source-db -e "SHOW BINARY LOG STATUS;"
telnet source-db 3306

Root cause: network jitter and permission issues caused the IO thread to reconnect repeatedly, accumulating lag.

Fix network stability or adjust firewall rules.

Ensure the replication account has the minimum required privileges.

Schedule heavy DDL or batch jobs during low‑traffic windows.

Case 3 – DDL and metadata lock stall SQL thread

Symptoms: after a schema change, replica lag stays > 40 minutes; IO thread runs, Relay Log grows, but SQL thread cannot catch up.

SHOW REPLICA STATUS\G
SELECT * FROM performance_schema.metadata_locks\G
SELECT * FROM information_schema.innodb_trx\G
SHOW PROCESSLIST;
SHOW ENGINE INNODB STATUS\G

Result: "Waiting for table metadata lock". The primary’s DDL caused a long‑running query on the replica to hold a metadata lock.

Terminate the blocking query.

Run DDL on large tables during off‑peak periods.

Pre‑test DDL impact on replication in a staging replica.

Best Practices

Enable parallel replication but verify that the write pattern actually benefits; large transactions do not magically speed up.

Throttle batch jobs, DDL, and huge transactions; use maintenance windows.

Monitor not only Seconds_Behind_* but also IO/SQL thread health and Relay Log growth.

Set super_read_only=ON on replicas and grant the replication account only minimal privileges.

Avoid binding all read traffic to a single replica; enable automatic read‑traffic failover when lag spikes.

Even if Seconds_Behind_Master or Seconds_Behind_Source is zero, the replica may still be unhealthy (threads disconnected, temporary catch‑up, etc.).

Monitoring and Alerting

Log inspection (example for Linux):

grep -Ei 'replica|slave|relay|error' /var/log/mysqld.log | tail -50

Prometheus alert rules (example):

groups:
- name: mysql-replication
  rules:
  - alert: MySQLReplicationLagHigh
    expr: mysql_slave_status_seconds_behind_master > 10
    for: 5m
  - alert: MySQLReplicationSQLThreadDown
    expr: mysql_slave_status_sql_running == 0
    for: 1m
  - alert: MySQLReplicationIOThreadDown
    expr: mysql_slave_status_slave_io_running == 0
    for: 1m

Backup and Recovery

#!/usr/bin/env bash
set -euo pipefail
mysql -uroot -p -e "SHOW REPLICA STATUS\G" > /backup/replica-status-$(date +%F).txt
mysql -uroot -p -e "SHOW MASTER STATUS\G" > /backup/master-status-$(date +%F).txt

Collect current state (e.g., run the helper script).

STOP REPLICA;

Fix the root cause (e.g., kill long queries, adjust DDL schedule).

START REPLICA; verify catch‑up with SHOW REPLICA STATUS\G.

Conclusion

Do not rely solely on replication‑lag seconds.

First separate IO and SQL thread health, then examine Relay Log size and transaction characteristics.

Large transactions, DDL, and hotspot tables are the most frequent lag amplifiers.

Parallel replication helps but cannot split a single massive transaction.

databaseMySQLreplicationTroubleshootingLag
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.