Zero Data Loss MySQL Master‑Slave Replication Lag Diagnosis & GTID Failover
This comprehensive guide explains how to diagnose MySQL master‑slave replication lag, enable GTID mode, configure semi‑synchronous replication, optimize multi‑threaded replication, set up monitoring and alerting with Prometheus, and perform zero‑data‑loss failover using tools like Orchestrator and custom scripts.
MySQL Master‑Slave Replication Lag Diagnosis & GTID Failover (Zero Data Loss)
Applicable Scenarios & Prerequisites
Applicable Business : Read/write split architecture, high‑availability clusters, cross‑region disaster recovery, data‑warehouse ETL sync.
Prerequisites :
MySQL ≥ 5.7 (recommended 8.0+ with GTID enhancements)
Master‑slave network latency < 10 ms (same data center) / < 50 ms (cross‑region)
Binary logging enabled on master ( log_bin=ON)
Slave disk I/O performance ≥ master (SSD recommended)
Root or SUPER privileges
Environment & Version Matrix
Component Version Requirement Key Features Typical Replication Lag
MySQL 5.7 5.7.17+ GTID, multi‑threaded, semi‑sync < 1 s (LAN)
MySQL 8.0 8.0.23+ GTID auto‑positioning, writeset parallel < 500 ms
MariaDB 10.5+ 10.5+ GTID compatible, multi‑source < 1 s
MHA 0.58 Automatic failover (traditional) –
Orchestrator 3.2+ GTID failover & topology management –Quick Checklist
Verify replication status ( SHOW SLAVE STATUS)
Diagnose root cause of lag (network/SQL thread/large transactions)
Enable GTID mode (online migration or new setup)
Configure semi‑synchronous replication (prevent data loss)
Optimize multi‑threaded replication (parallel apply)
Monitor replication lag and alerts ( Seconds_Behind_Master)
Test failover (master outage scenario)
Execute GTID failover ( CHANGE MASTER TO … AUTO_POSITION=1)
Validate data consistency ( pt-table-checksum)
Configure automatic failover tools (Orchestrator/MHA)
Implementation Steps
Step 1: Verify Replication Status
Run on the slave: SHOW SLAVE STATUS\G Key fields to check: Slave_IO_Running and Slave_SQL_Running must be
Yes Seconds_Behind_Mastershould be < 5 s (ideal < 1 s) Last_Errno / Last_SQL_Errno should be 0 GTID gap ( Retrieved_Gtid_Set vs Executed_Gtid_Set) < 10 transactions
Quick diagnostic commands:
# Simplified status check
mysql -e "SHOW SLAVE STATUS\G" | grep -E "Running|Behind|Error"
# Show GTID difference
mysql -e "SHOW SLAVE STATUS\G" | grep -E "Retrieved_Gtid_Set|Executed_Gtid_Set"Step 2: Diagnose Lag Root Cause
Scenario 1 – Network Latency
# Test network latency
ping -c 10 192.168.1.10
time mysql -h 192.168.1.10 -u repl -p -e "SELECT 1"Scenario 2 – Large Transactions
# List long‑running transactions on master
SELECT * FROM information_schema.INNODB_TRX WHERE trx_started < NOW() - INTERVAL 30 SECOND ORDER BY trx_started;
# Show large events in binlog
SHOW BINLOG EVENTS IN 'mysql-bin.000023' LIMIT 10;Scenario 3 – Single‑threaded SQL replay
# Check parallel replay configuration
SHOW VARIABLES LIKE 'slave_parallel%';
# Current parallel worker count
SELECT COUNT(*) FROM performance_schema.replication_applier_status_by_worker;Scenario 4 – Slave disk I/O bottleneck
# Monitor I/O
iostat -x 1 10 | grep -E "Device|sd"
# Check MySQL I/O waits
mysql -e "SHOW PROCESSLIST" | grep "Waiting for"Combined diagnostic script:
#!/bin/bash
# /usr/local/bin/diagnose_replication_lag.sh
echo "=== MySQL Replication Lag Diagnosis ==="
# 1. Lag value
DELAY=$(mysql -ss -e "SHOW SLAVE STATUS\G" | grep "Seconds_Behind_Master:" | awk '{print $2}')
echo "Current lag: $DELAY seconds"
# 2. IO thread status
IO_STATE=$(mysql -ss -e "SHOW SLAVE STATUS\G" | grep "Slave_IO_Running:" | awk '{print $2}')
echo "IO thread: $IO_STATE"
# 3. SQL thread status
SQL_STATE=$(mysql -ss -e "SHOW SLAVE STATUS\G" | grep "Slave_SQL_Running:" | awk '{print $2}')
echo "SQL thread: $SQL_STATE"
# 4. GTID gap
RETRIEVED=$(mysql -ss -e "SHOW SLAVE STATUS\G" | grep "Retrieved_Gtid_Set:" | awk -F: '{print $NF}' | awk -F- '{print $NF}')
EXECUTED=$(mysql -ss -e "SHOW SLAVE STATUS\G" | grep "Executed_Gtid_Set:" | awk -F: '{print $NF}' | awk -F- '{print $NF}')
GTID_GAP=$((RETRIEVED - EXECUTED))
echo "GTID pending: $GTID_GAP transactions"
# 5. Network latency
MASTER_HOST=$(mysql -ss -e "SHOW SLAVE STATUS\G" | grep "Master_Host:" | awk '{print $2}')
PING_LATENCY=$(ping -c 5 $MASTER_HOST | tail -1 | awk -F'/' '{print $5}')
echo "Network latency: $PING_LATENCY ms"
# 6. Parallel workers
PARALLEL_WORKERS=$(mysql -ss -e "SHOW VARIABLES LIKE 'slave_parallel_workers'" | awk '{print $2}')
echo "Parallel workers: $PARALLEL_WORKERS"Step 3: Enable GTID Mode (Online Migration)
Pre‑checks to ensure compatibility (no MyISAM tables, deterministic routines) and then modify /etc/my.cnf on master and slave:
# Master my.cnf additions
gtid_mode=ON_PERMISSIVE
enforce_gtid_consistency=ON # Restart master
systemctl restart mysqld # Slave my.cnf additions (same settings)
gtid_mode=ON_PERMISSIVE
enforce_gtid_consistency=ON # Restart slave
systemctl restart mysqldSwitch master to pure GTID mode:
SET GLOBAL gtid_mode=ON;On the new master, enable auto‑positioning:
STOP SLAVE;
SET GLOBAL gtid_mode=ON;
CHANGE MASTER TO MASTER_AUTO_POSITION=1;
START SLAVE;Step 4: Configure Semi‑Synchronous Replication
Install plugin on master:
INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';
SET GLOBAL rpl_semi_sync_master_enabled=1;
SET GLOBAL rpl_semi_sync_master_timeout=1000; -- 1 s timeoutInstall plugin on slave:
INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';
SET GLOBAL rpl_semi_sync_slave_enabled=1;
STOP SLAVE IO_THREAD;
START SLAVE IO_THREAD;Persist settings in /etc/my.cnf on both nodes.
Verify status:
SHOW STATUS LIKE 'Rpl_semi_sync_master_status'; -- should be ON
SHOW STATUS LIKE 'Rpl_semi_sync_slave_status'; -- should be ONStep 5: Optimize Multi‑Threaded Replication
MySQL 5.7 configuration:
# /etc/my.cnf
slave_parallel_type=LOGICAL_CLOCK
slave_parallel_workers=4 -- recommend = CPU cores
slave_preserve_commit_order=1MySQL 8.0 enhanced configuration:
# /etc/my.cnf
slave_parallel_type=LOGICAL_CLOCK
slave_parallel_workers=8
binlog_transaction_dependency_tracking=WRITESET
transaction_write_set_extraction=XXHASH64Restart slave after changes.
Step 6: Monitoring Replication Lag & Alerts
Expose metrics with mysqld_exporter and create Prometheus alert rules:
# Prometheus rule example
- alert: MySQLReplicationLag
expr: mysql_slave_status_seconds_behind_master > 30
for: 5m
labels:
severity: warning
annotations:
summary: "MySQL replication lag exceeds 30 seconds"Step 7: Test Failover (Simulate Master Crash)
Stop master, promote the most up‑to‑date slave using GTID auto‑positioning, then redirect other slaves:
# On new master
STOP SLAVE;
RESET SLAVE ALL;
SET GLOBAL read_only=0;
SET GLOBAL super_read_only=0;
# On other slaves
STOP SLAVE;
CHANGE MASTER TO MASTER_HOST='new-master-ip', MASTER_USER='repl', MASTER_PASSWORD='password', MASTER_PORT=3306, MASTER_AUTO_POSITION=1;
START SLAVE;Validate data consistency with pt-table-checksum and fix differences using pt-table-sync.
Step 8: Deploy Automatic Failover Tool (Orchestrator)
Install Orchestrator (RPM or Docker), configure /etc/orchestrator.conf.json, create the orchestrator database and topology user, then start the service and discover the topology.
{
"MySQLTopologyUser": "orc_user",
"MySQLTopologyPassword": "password",
"MySQLOrchestratorHost": "127.0.0.1",
"MySQLOrchestratorPort": 3306,
"MySQLOrchestratorDatabase": "orchestrator",
"MySQLOrchestratorUser": "orc_server",
"MySQLOrchestratorPassword": "orc_password",
"RecoveryPeriodBlockSeconds": 300,
"RecoverMasterClusterFilters": ["*"],
"RecoverIntermediateMasterClusterFilters": ["*"],
"ApplyMySQLPromotionAfterMasterFailover": true
}Test automatic failover by stopping the master and observing Orchestrator’s log.
Monitoring & Alerting
Key metrics:
Replication lag ( Seconds_Behind_Master)
GTID execution progress ( gtid_executed)
Semi‑sync status ( Rpl_semi_sync_master_status, Rpl_semi_sync_slave_status)
Replication errors ( Last_Errno, Last_Error)
Performance & Capacity Planning
Typical bottlenecks and mitigation:
Network latency – use dedicated links or tune semi‑sync timeout
Single‑threaded replay – enable slave_parallel_workers Large transactions – split or use binlog_row_image=MINIMAL Disk I/O – upgrade to SSD, adjust
innodb_flush_log_at_trx_commitSecurity & Compliance
Use a dedicated replication account with least privileges and enable SSL encryption for replication traffic.
# Create replication user
CREATE USER 'repl'@'%' IDENTIFIED BY 'StrongPassword123!';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';
# Enable SSL on slave
STOP SLAVE;
CHANGE MASTER TO MASTER_SSL=1, MASTER_SSL_CA='/var/lib/mysql/ca.pem', MASTER_SSL_CERT='/var/lib/mysql/client-cert.pem', MASTER_SSL_KEY='/var/lib/mysql/client-key.pem';
START SLAVE;Best Practices
Always enable GTID for simplified failover.
Enable semi‑synchronous replication (at least one semi‑sync slave).
Use multi‑threaded replication (e.g., slave_parallel_workers ≥ 4).
Set slaves to read‑only ( read_only=1, super_read_only=1).
Monitor replication lag with alerts < 30 s.
Run weekly data‑consistency checks with pt-table-checksum.
Deploy automatic failover tools (Orchestrator/MHA) in production.
Avoid manual writes on slaves to preserve GTID consistency.
Split large transactions (< 1 GB) to prevent replication stalls.
Quarterly failover drills with data validation.
Appendix – Full Configuration Templates
Master my.cnf
[mysqld]
server_id=1
log_bin=mysql-bin
binlog_format=ROW
gtid_mode=ON
enforce_gtid_consistency=ON
log_slave_updates=ON
rpl_semi_sync_master_enabled=1
rpl_semi_sync_master_timeout=1000
expire_logs_days=7
max_binlog_size=1G
sync_binlog=1
innodb_flush_log_at_trx_commit=1Slave my.cnf
[mysqld]
server_id=2
log_bin=mysql-bin
binlog_format=ROW
gtid_mode=ON
enforce_gtid_consistency=ON
log_slave_updates=ON
read_only=1
super_read_only=1
rpl_semi_sync_slave_enabled=1
slave_parallel_type=LOGICAL_CLOCK
slave_parallel_workers=8
slave_preserve_commit_order=1
binlog_transaction_dependency_tracking=WRITESET
transaction_write_set_extraction=XXHASH64
relay_log=relay-bin
relay_log_recovery=ONAnsible Playbook for Slave Configuration
---
- name: Configure MySQL master‑slave replication
hosts: mysql_slaves
become: yes
vars:
master_host: "192.168.1.10"
repl_user: "repl"
repl_password: "StrongPassword123!"
tasks:
- name: Stop slave replication
mysql_replication:
mode: stopslave
- name: Set master connection
mysql_replication:
mode: changemaster
master_host: "{{ master_host }}"
master_user: "{{ repl_user }}"
master_password: "{{ repl_password }}"
master_auto_position: yes
- name: Start slave replication
mysql_replication:
mode: startslave
- name: Verify replication status
mysql_replication:
mode: getslave
register: slave_status
failed_when: slave_status.Slave_IO_Running != 'Yes' or slave_status.Slave_SQL_Running != 'Yes'
- name: Show replication lag
debug:
msg: "Replication lag: {{ slave_status.Seconds_Behind_Master }} seconds"Test Environment : MySQL 8.0.35 on RHEL 8.8 / Ubuntu 22.04 LTS. Test Date : 2025‑10‑31. Maintenance Cycle : Quarterly configuration review, weekly consistency checks, quarterly failover drills.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
