Comprehensive MySQL Replication Lag Troubleshooting Beyond Seconds_Behind_Master
This guide walks through a complete MySQL master‑slave lag diagnosis process, explaining why relying solely on Seconds_Behind_Master is insufficient and showing how to separate IO and SQL thread issues, examine relay logs, detect long transactions, DDL locks, and apply best‑practice configurations and monitoring.
1. Overview
When replication lag spikes, many people first look at the Seconds_Behind_Master metric. While useful for confirming that lag exists, it does not reveal whether the bottleneck is network pull, relay‑log accumulation, slow SQL execution, insufficient parallel replication, or downstream long transactions, DDL, or hot‑table contention.
Production‑grade troubleshooting starts by distinguishing the IO thread from the SQL thread and then determining whether the problem is pull‑slow, write‑slow, or application‑slow.
Technical Characteristics
Replication chain: Source → Network → IO Thread → Relay Log → SQL/Applier Thread.
MySQL 8.0 recommends using SHOW REPLICA STATUS; older versions may still show Seconds_Behind_Master.
Practical guidance includes not only how to investigate but also how to avoid long‑running transactions that block replication.
Applicable Scenarios
Read traffic hits the replica and stale data is returned after a sudden lag increase.
Replica’s relay log piles up while the source writes normally.
After a deployment, DDL, or batch job, replication lag persists for minutes or even hours.
Environment Requirements
MySQL 8.0+ (recommended).
GTID or traditional binlog position replication mode.
Metrics collected via mysqld_exporter.
Replication‑related privileges (access to performance_schema).
2. Detailed Steps
2.1 Preparation
System checks: verify server_id, log_bin, binlog_format=ROW, gtid_mode=ON, enforce_gtid_consistency=ON, relay_log_recovery=ON, slave_parallel_type=LOGICAL_CLOCK, slave_parallel_workers=8, read_only=ON, super_read_only=ON.
Install dependencies: mysql-client, jq (apt or yum).
First round of verification using SHOW REPLICA STATUS\\G and performance_schema views.
2.2 Core Configuration
Step 1 – Identify pull‑slow vs. execution‑slow: Check Replica_IO_Running / Slave_IO_Running, Replica_SQL_Running / Slave_SQL_Running, Seconds_Behind_Source / Seconds_Behind_Master, Relay_Log_Space, and error fields Last_IO_Error, Last_SQL_Error.
If the IO thread is not running, run SHOW REPLICA STATUS\\G to inspect connection errors.
If the SQL thread is running but cannot keep up, query performance_schema.replication_applier_status_by_worker for worker progress.
Standard configuration example (saved in /etc/my.cnf.d/replication.cnf):
# replication.cnf
[mysqld]
server_id=102
log_bin=mysql-bin
binlog_format=ROW
gtid_mode=ON
enforce_gtid_consistency=ON
relay_log_recovery=ON
slave_parallel_type=LOGICAL_CLOCK
slave_parallel_workers=8
read_only=ON
super_read_only=ON binlog_format=ROW– ensures consistent replication. relay_log_recovery=ON – easier recovery after crashes. slave_parallel_workers=8 – parallel workers (tune per workload). super_read_only=ON – prevents accidental writes to the replica.
Step 3 – Deep dive: Examine replication status, binary log status, processlist, InnoDB transaction table, and stage events to see if a single large transaction or metadata lock is holding the SQL thread.
2.3 Real‑World Cases
Case 1 – Large transaction blocks SQL thread: Replica lag grew to 18 minutes while the master showed no errors. SHOW REPLICA STATUS\\G revealed Seconds_Behind_Source: 1087 and a huge Relay_Log_Space. Solution: split batch updates, apply large table changes in smaller chunks, keep parallel replication enabled but understand it cannot split a single massive transaction.
Case 2 – IO thread cannot fetch new binlog: Relay_Log_Space stayed small while lag increased. Investigation of SHOW REPLICA STATUS\\G, network connectivity (telnet to source), and replication account permissions identified intermittent network drops and permission issues causing the IO thread to reconnect repeatedly.
Case 3 – DDL and metadata lock stall SQL thread: After a schema change, replica lag persisted for >40 minutes. Queries on performance_schema.metadata_locks and information_schema.innodb_trx showed a waiting metadata lock. Fix: terminate the blocking query, schedule DDL during low‑traffic windows, and pre‑test DDL impact on replication.
These examples illustrate why looking only at Seconds_Behind_Master can hide the true cause.
3. Best Practices & Precautions
3.1 Performance Optimisation
Enable parallel replication but verify that the write pattern benefits from it; hot tables and single huge transactions will not speed up automatically.
Throttle batch jobs, DDL, and large transactions; use windowed execution.
Monitor not just the lag seconds but also IO/SQL thread states and relay‑log growth.
3.2 Security Hardening
Set super_read_only=ON on replicas.
Grant the replication account only the minimal privileges and rotate passwords regularly.
Validate topology changes on a replica before rolling out to production.
3.3 High‑Availability Considerations
Do not bind critical read traffic to a single replica; enable automatic fail‑over when lag exceeds thresholds.
Link replication monitoring with business read‑latency alerts.
Back up SHOW REPLICA STATUS and performance_schema snapshots regularly.
4. Common Pitfalls & Error Table
Symptom: High lag seconds but all threads appear running – Cause: SQL thread stuck on a large transaction. Solution: Inspect long transactions, DDL, hot tables.
Symptom: Low lag seconds yet reads return stale data – Cause: Monitoring window too coarse or brief catch‑up. Solution: Increase sampling granularity.
Symptom: Relay log continuously growing – Cause: IO thread pulling fine, SQL thread cannot apply fast enough. Solution: Check apply capacity, lock waits, resource limits.
5. Fault Diagnosis & Monitoring
5.1 Fault Diagnosis
Log inspection:
grep -Ei 'replica|slave|relay|error' /var/log/mysqld.log | tail -50Common questions:
Lag high but Relay_Log_Space small – first check IO thread, network, and permissions.
Relay log large and lag rising – focus on SQL thread apply speed, lock contention, large transactions.
Parallel replication enabled but lag unchanged – verify that the workload can be parallelised; hotspot tables or single huge transactions limit benefit.
Debug mode: run SHOW REPLICA STATUS\\G and SHOW ENGINE INNODB STATUS\\G for deeper insight.
5.2 Performance Monitoring
Key metrics (to be scraped by mysqld_exporter or Prometheus):
Replication lag seconds – normal < 1s, alert if > 10s for >5 min.
IO thread state – should be Running; alert on any other state.
SQL thread state – should be Running; alert on non‑running.
Relay log space – stable is OK; continuous growth for >15 min triggers alert.
Alert rules example (Prometheus‑style):
groups:
- name: mysql-replication
rules:
- alert: MySQLReplicationLagHigh
expr: mysql_slave_status_seconds_behind_master > 10
for: 5m
- alert: MySQLReplicationSQLThreadDown
expr: mysql_slave_status_sql_running == 0
for: 1m
- alert: MySQLReplicationIOThreadDown
expr: mysql_slave_status_slave_io_running == 0
for: 1m5.3 Backup & Recovery
Backup script captures replica and master status snapshots:
#!/usr/bin/env bash
set -euo pipefail
mysql -uroot -p -e "SHOW REPLICA STATUS\\G" > /backup/replica-status-$(date +%F).txt
mysql -uroot -p -e "SHOW MASTER STATUS\\G" > /backup/master-status-$(date +%F).txtRecovery steps:
Collect current state with the helper script.
Stop replication: STOP REPLICA; Fix the root cause (e.g., kill blocking query, adjust configuration).
Start replication: START REPLICA; Verify catch‑up using SHOW REPLICA STATUS\\G.
6. Conclusion
Do not rely solely on Seconds_Behind_Master; always examine IO/SQL thread health and relay‑log growth.
Large transactions, DDL, and hot tables are the most common replication lag amplifiers.
Parallel replication helps but cannot split a single massive transaction.
Further Learning
GTID‑based topology management.
Advanced parallel replication and transaction splitting techniques.
Read‑traffic routing based on real‑time replication lag.
References
MySQL Replication Status – field definitions.
Performance Schema Replication Tables – view documentation.
MySQL Replication Options – configuration guide.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
