Databases 23 min read

How to Reduce MySQL Master‑Slave Replication Lag from 30 seconds to Milliseconds

This article walks through the root causes of MySQL master‑slave replication delay, demonstrates step‑by‑step diagnostics using SHOW SLAVE STATUS, pt‑heartbeat, and binlog comparisons, and provides concrete configuration changes, query rewrites, hardware upgrades, and monitoring scripts that can shrink lag from dozens of seconds to sub‑millisecond levels.

Raymond Ops
Raymond Ops
Raymond Ops
How to Reduce MySQL Master‑Slave Replication Lag from 30 seconds to Milliseconds

Overview

Replication lag in read‑write split architectures can cause stale reads, duplicate orders, and refund disputes. A real incident during a Double‑11 promotion showed lag exceeding 30 seconds, prompting a week‑long investigation.

Replication Basics

The replication pipeline is:

master writes → binlog → IO thread pulls binlog → writes relay log → SQL thread replays relay log → replica data updated

Typical delay sources include:

Slow binlog generation (large transactions)

Network latency

Slow relay‑log write (disk I/O)

SQL thread replay speed – the most common cause (single‑threaded by default)

Delay Detection

Basic Status Check

Run the following command on the replica: SHOW SLAVE STATUS\G Key fields to examine: Slave_IO_Running – whether the IO thread is active Slave_SQL_Running – whether the SQL thread is active Seconds_Behind_Master – reported lag (may be inaccurate) Master_Log_File and

Read_Master_Log_Pos
Relay_Master_Log_File

and Exec_Master_Log_Pos Why Seconds_Behind_Master can be misleading :

If the IO thread stalls, the value stays at 0 while actual lag grows.

Long periods without writes also produce inaccurate numbers.

Precise Measurement with pt‑heartbeat

Percona Toolkit’s pt‑heartbeat creates a table on the master that is updated every second; the replica reads the timestamp to compute exact lag.

# Install
yum install -y percona-toolkit   # CentOS
apt install -y percona-toolkit   # Ubuntu
# Create heartbeat table on master
mysql -uroot -p -e "CREATE DATABASE IF NOT EXISTS percona"
pt-heartbeat --database=percona --create-table --update --daemonize -h 127.0.0.1 -u root -p'password'
# Monitor from replica
pt-heartbeat --database=percona --monitor --master-server-id=1 -h replica_ip -u root -p'password'
# Output example: 0.00s [ 1‑min avg, 5‑min avg, 15‑min avg ]

Binlog Position Comparison

Compare master and replica positions to see how many binlog files differ:

# Master
SHOW MASTER STATUS;
# Replica
SHOW SLAVE STATUS\G;
# Compare Relay_Master_Log_File and Exec_Master_Log_Pos

Root‑Cause Analysis

Master Write Load

Heavy write traffic can outpace the replica. Check write rates with:

SHOW GLOBAL STATUS LIKE 'Com_insert';
SHOW GLOBAL STATUS LIKE 'Com_update';
SHOW GLOBAL STATUS LIKE 'Com_delete';

Single‑Threaded SQL Replay

Before MySQL 5.6 the replica SQL thread is single‑threaded. MySQL 5.7+ supports parallel replication based on LOGICAL_CLOCK:

SHOW VARIABLES LIKE 'slave_parallel%';
# Typical output: slave_parallel_workers = 0 (single thread)

Large Transactions

A massive transaction (e.g., updating millions of rows) is applied atomically on the replica, blocking other work.

Network Latency

Cross‑region deployments may suffer network delays. Verify with ping and iperf.

Replica Hardware Bottlenecks

Disk I/O and CPU saturation on the replica can throttle replay. Use iostat and top to monitor.

Optimization Strategies

Enable Parallel Replication (Most Effective)

Configure the replica:

# /etc/my.cnf on replica
[mysqld]
slave_parallel_type = LOGICAL_CLOCK
slave_parallel_workers = 16
slave_preserve_commit_order = ON
relay_log_recovery = ON
relay_log_info_repository = TABLE
master_info_repository = TABLE

Restart the replica to apply:

STOP SLAVE;
systemctl restart mysqld;
START SLAVE;
SHOW VARIABLES LIKE 'slave_parallel%';

Split Large Transactions

Break big DML into smaller batches, for example:

SET @batch_size = 10000;
SET @last_id = 0;
WHILE EXISTS (SELECT 1 FROM orders WHERE id > @last_id AND create_time < '2024-01-01' LIMIT 1) DO
  UPDATE orders SET status='closed' WHERE id > @last_id AND create_time < '2024-01-01'
  ORDER BY id LIMIT @batch_size;
  SELECT MAX(id) INTO @last_id FROM orders WHERE status='closed' AND create_time < '2024-01-01';
  DO SLEEP(0.1);
END WHILE;

Adjust Binlog Settings on Master

Optimise binlog generation and group commit:

# Master (MySQL 5.7+)
binlog_format = ROW
max_binlog_size = 100M
binlog_row_image = MINIMAL
binlog_group_commit_sync_delay = 100   # microseconds
binlog_group_commit_sync_no_delay_count = 10

Replica‑Side Tweaks

Relax flushing to improve throughput (acceptable if the master can recover data):

innodb_flush_log_at_trx_commit = 2
sync_binlog = 0
innodb_doublewrite = OFF
relay_log_space_limit = 0   # unlimited

Monitoring Script

A Bash script that checks replication status, lag, and sends alerts (example uses a DingTalk webhook):

#!/bin/bash
MYSQL_USER="monitor"
MYSQL_PASS="monitor_password"
SLAVE_HOST="127.0.0.1"
ALERT_THRESHOLD=10   # seconds
CRITICAL_THRESHOLD=60
WEBHOOK_URL="https://oapi.dingtalk.com/robot/send?access_token=xxx"

check_repl_status() {
  result=$(mysql -h $SLAVE_HOST -u $MYSQL_USER -p$MYSQL_PASS -e "SHOW SLAVE STATUS\G" 2>/dev/null)
  io=$(echo "$result" | grep "Slave_IO_Running:" | awk '{print $2}')
  sql=$(echo "$result" | grep "Slave_SQL_Running:" | awk '{print $2}')
  lag=$(echo "$result" | grep "Seconds_Behind_Master:" | awk '{print $2}')

  if [[ "$io" != "Yes" || "$sql" != "Yes" ]]; then
    msg="[CRITICAL] Replication stopped – IO=$io SQL=$sql"
    echo "$msg"
    curl -s -H "Content-Type: application/json" -d "{\"msgtype\":\"text\",\"text\":{\"content\":\"$msg\"}}" "$WEBHOOK_URL" >/dev/null
    exit 2
  fi

  if [[ "$lag" == "NULL" ]]; then
    echo "[WARNING] Unable to obtain lag"
    exit 1
  elif (( lag > CRITICAL_THRESHOLD )); then
    msg="[CRITICAL] Replication lag ${lag}s exceeds $CRITICAL_THRESHOLD s"
    echo "$msg"
    curl -s -H "Content-Type: application/json" -d "{\"msgtype\":\"text\",\"text\":{\"content\":\"$msg\"}}" "$WEBHOOK_URL" >/dev/null
    exit 2
  elif (( lag > ALERT_THRESHOLD )); then
    msg="[WARNING] Replication lag ${lag}s exceeds $ALERT_THRESHOLD s"
    echo "$msg"
    curl -s -H "Content-Type: application/json" -d "{\"msgtype\":\"text\",\"text\":{\"content\":\"$msg\"}}" "$WEBHOOK_URL" >/dev/null
    exit 1
  else
    echo "[OK] Replication lag ${lag}s"
    exit 0
  fi
}

check_repl_status

Real‑World Cases

Case 1 – From 30 s to Milliseconds

Problem : An e‑commerce platform with ~3 k TPS on the master saw replica lag of 20‑30 seconds.

Investigation : SHOW SLAVE STATUS reported Seconds_Behind_Master = 28; SHOW VARIABLES LIKE 'slave_parallel%' showed slave_parallel_workers = 0 (single thread).

Solution : Added to /etc/my.cnf on the replica:

slave_parallel_type = LOGICAL_CLOCK
slave_parallel_workers = 16
slave_preserve_commit_order = ON

Restarted MySQL and verified the variables. Result : Seconds_Behind_Master = 0 ; latency dropped to sub‑millisecond range.

Case 2 – Large Transaction Spike

Problem : Normal lag <1 s, but a nightly batch DELETE caused a 180 s pause.

Investigation : Slow‑query log showed the long DELETE; mysqlbinlog confirmed many DELETE events in the binlog.

Solution : Rewrote the cleanup script to delete in batches of 10 k rows with a short sleep:

#!/bin/bash
while true; do
  deleted=$(mysql -uroot -p'xxx' -e "DELETE FROM logs WHERE create_time < DATE_SUB(NOW(), INTERVAL 90 DAY) LIMIT 10000; SELECT ROW_COUNT();" -N)
  if [[ "$deleted" -eq 0 ]]; then
    echo "Cleanup complete"
    break
  fi
  echo "Deleted $deleted rows, sleeping 1s..."
  sleep 1
done

Result : The nightly cleanup no longer caused a long‑running transaction; replication latency remained stable.

Best Practices & Caveats

Use GTID replication ( gtid_mode=ON, enforce_gtid_consistency=ON) for easier failover.

Enable semi‑synchronous replication only when strong data consistency is required.

Deploy a read‑write splitting proxy (ProxySQL, MaxScale) instead of hard‑coding hosts.

Tune master binlog settings (ROW format, group‑commit delay) and replica parameters (parallel workers, relaxed flushing).

Continuously monitor Seconds_Behind_Master, Slave_IO_Running, Slave_SQL_Running, and pt‑heartbeat latency.

Ensure each server‑id is unique; avoid changing GTID mode after it is enabled; parallel replication helps only when the master generates parallelizable work.

Key Takeaways

Seconds_Behind_Master

alone is insufficient; use pt‑heartbeat for accurate latency.

Enabling parallel replication (LOGICAL_CLOCK) is the first line of defense against lag.

Large transactions are the biggest latency enemy – split them into smaller batches.

Continuous monitoring and alerting are essential to catch regressions early.

MonitoringLatencyMySQLreplicationparallel replicationpt-heartbeat
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.