Why MySQL Replication Lag Happens and How to Fix It: Real-World Cases
This article explains why MySQL master‑slave replication lag occurs, why it matters for data consistency and high‑availability, and provides six real‑world case studies with root‑cause analysis and practical solutions to eliminate the delay.
Importance of Replication Delay
MySQL master‑slave replication is widely used in UDB cloud databases; any delay harms data consistency and can block high‑availability failover, leading to serious business impact.
Production‑Level Analysis and Solutions
We summarize the most common replication‑lag scenarios, describing symptoms, root causes, and remediation steps.
Case 1: Frequent DML on the Primary
Symptoms : Sudden spike in primary write QPS accompanied by rising lag.
Root Cause : Massive binlog generation overwhelms the single‑threaded SQL thread on the replica, causing relay‑log backlog.
Solution : For MySQL 5.7‑, shard the workload; for MySQL 5.7/8.0 enable parallel replication (group commit or write‑set based).
Case 2: Large Transactions on the Primary
Symptoms : Exec_Master_Log_Pos unchanged, Seconds_Behind_Master continuously increasing, and Slave_SQL_Running_State shows “Reading event from the relay log”.
Root Cause : A big transaction takes a long time to replay on the replica, accumulating lag.
Solution : Split large transactions into smaller ones to allow timely commits.
Case 3: DDL on Large Tables
Symptoms : Exec_Master_Log_Pos stagnant while DDL is executing; Slave_SQL_Running_State shows “altering table”.
Root Cause : Replicating DDL statements on large tables is slow, similar to big transactions.
Solution : Identify blocking DDL via SHOW PROCESSLIST or information_schema.innodb_trx and kill it; schedule DDL during low‑traffic periods or run it with sql_log_bin=0 on both primary and replica.
Case 4: Mismatched Primary/Replica Configurations
Symptoms : Significant performance metric differences between primary and replica.
Root Cause : Different hardware (e.g., SSD vs. SAS) or divergent MySQL parameters cause replication lag.
Solution : Align hardware and configuration; sometimes replica should have equal or higher resources than primary.
Case 5: Missing Primary Key or Proper Index
Symptoms : High CPU on replica, SHOW SLAVE STATUS shows “Reading event from the relay log”, and SHOW OPEN TABLES WHERE in_use=1 lists the table.
Root Cause : In ROW format, updates on tables without a primary key cause full‑table scans for each row, slowing replay.
Solution : Ensure every table has an explicit auto‑increment primary key and appropriate indexes.
Case 6: Replica Overload
Symptoms : High CPU or I/O on replica, causing SQL thread to fall behind.
Root Cause : Heavy read traffic or OLAP workloads on the replica exceed its capacity.
Solution : Add more replicas to distribute reads, isolate OLAP workloads to dedicated replicas, and accept a controlled amount of lag for those replicas.
Summary
When using MySQL master‑slave replication, lag is a critical factor affecting consistency and high‑availability failover. Our team distilled practical troubleshooting steps: monitor SHOW SLAVE STATUS and SHOW PROCESSLIST, identify big transactions, DDL, missing keys, configuration mismatches, or replica overload, and apply the corresponding mitigations.
Source: http://blog.ucloud.cn/archives/4132
Architecture Talk
Rooted in the "Dao" of architecture, we provide pragmatic, implementation‑focused architecture content.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
