Databases 23 min read

Zero Data Loss MySQL Master‑Slave Replication Lag Diagnosis & GTID Failover

This comprehensive guide explains how to diagnose MySQL master‑slave replication lag, enable GTID mode, configure semi‑synchronous replication, optimize multi‑threaded replication, set up monitoring and alerting with Prometheus, and perform zero‑data‑loss failover using tools like Orchestrator and custom scripts.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Zero Data Loss MySQL Master‑Slave Replication Lag Diagnosis & GTID Failover

MySQL Master‑Slave Replication Lag Diagnosis & GTID Failover (Zero Data Loss)

Applicable Scenarios & Prerequisites

Applicable Business : Read/write split architecture, high‑availability clusters, cross‑region disaster recovery, data‑warehouse ETL sync.

Prerequisites :

MySQL ≥ 5.7 (recommended 8.0+ with GTID enhancements)

Master‑slave network latency < 10 ms (same data center) / < 50 ms (cross‑region)

Binary logging enabled on master ( log_bin=ON)

Slave disk I/O performance ≥ master (SSD recommended)

Root or SUPER privileges

Environment & Version Matrix

Component          Version Requirement   Key Features                     Typical Replication Lag
MySQL 5.7         5.7.17+              GTID, multi‑threaded, semi‑sync < 1 s (LAN)
MySQL 8.0         8.0.23+              GTID auto‑positioning, writeset parallel < 500 ms
MariaDB 10.5+     10.5+                GTID compatible, multi‑source < 1 s
MHA                0.58                Automatic failover (traditional) –
Orchestrator      3.2+                GTID failover & topology management –

Quick Checklist

Verify replication status ( SHOW SLAVE STATUS)

Diagnose root cause of lag (network/SQL thread/large transactions)

Enable GTID mode (online migration or new setup)

Configure semi‑synchronous replication (prevent data loss)

Optimize multi‑threaded replication (parallel apply)

Monitor replication lag and alerts ( Seconds_Behind_Master)

Test failover (master outage scenario)

Execute GTID failover ( CHANGE MASTER TO … AUTO_POSITION=1)

Validate data consistency ( pt-table-checksum)

Configure automatic failover tools (Orchestrator/MHA)

Implementation Steps

Step 1: Verify Replication Status

Run on the slave: SHOW SLAVE STATUS\G Key fields to check: Slave_IO_Running and Slave_SQL_Running must be

Yes
Seconds_Behind_Master

should be < 5 s (ideal < 1 s) Last_Errno / Last_SQL_Errno should be 0 GTID gap ( Retrieved_Gtid_Set vs Executed_Gtid_Set) < 10 transactions

Quick diagnostic commands:

# Simplified status check
mysql -e "SHOW SLAVE STATUS\G" | grep -E "Running|Behind|Error"

# Show GTID difference
mysql -e "SHOW SLAVE STATUS\G" | grep -E "Retrieved_Gtid_Set|Executed_Gtid_Set"

Step 2: Diagnose Lag Root Cause

Scenario 1 – Network Latency

# Test network latency
ping -c 10 192.168.1.10

time mysql -h 192.168.1.10 -u repl -p -e "SELECT 1"

Scenario 2 – Large Transactions

# List long‑running transactions on master
SELECT * FROM information_schema.INNODB_TRX WHERE trx_started < NOW() - INTERVAL 30 SECOND ORDER BY trx_started;

# Show large events in binlog
SHOW BINLOG EVENTS IN 'mysql-bin.000023' LIMIT 10;

Scenario 3 – Single‑threaded SQL replay

# Check parallel replay configuration
SHOW VARIABLES LIKE 'slave_parallel%';

# Current parallel worker count
SELECT COUNT(*) FROM performance_schema.replication_applier_status_by_worker;

Scenario 4 – Slave disk I/O bottleneck

# Monitor I/O
iostat -x 1 10 | grep -E "Device|sd"

# Check MySQL I/O waits
mysql -e "SHOW PROCESSLIST" | grep "Waiting for"

Combined diagnostic script:

#!/bin/bash
# /usr/local/bin/diagnose_replication_lag.sh

echo "=== MySQL Replication Lag Diagnosis ==="

# 1. Lag value
DELAY=$(mysql -ss -e "SHOW SLAVE STATUS\G" | grep "Seconds_Behind_Master:" | awk '{print $2}')
echo "Current lag: $DELAY seconds"

# 2. IO thread status
IO_STATE=$(mysql -ss -e "SHOW SLAVE STATUS\G" | grep "Slave_IO_Running:" | awk '{print $2}')
echo "IO thread: $IO_STATE"

# 3. SQL thread status
SQL_STATE=$(mysql -ss -e "SHOW SLAVE STATUS\G" | grep "Slave_SQL_Running:" | awk '{print $2}')
echo "SQL thread: $SQL_STATE"

# 4. GTID gap
RETRIEVED=$(mysql -ss -e "SHOW SLAVE STATUS\G" | grep "Retrieved_Gtid_Set:" | awk -F: '{print $NF}' | awk -F- '{print $NF}')
EXECUTED=$(mysql -ss -e "SHOW SLAVE STATUS\G" | grep "Executed_Gtid_Set:" | awk -F: '{print $NF}' | awk -F- '{print $NF}')
GTID_GAP=$((RETRIEVED - EXECUTED))
echo "GTID pending: $GTID_GAP transactions"

# 5. Network latency
MASTER_HOST=$(mysql -ss -e "SHOW SLAVE STATUS\G" | grep "Master_Host:" | awk '{print $2}')
PING_LATENCY=$(ping -c 5 $MASTER_HOST | tail -1 | awk -F'/' '{print $5}')
echo "Network latency: $PING_LATENCY ms"

# 6. Parallel workers
PARALLEL_WORKERS=$(mysql -ss -e "SHOW VARIABLES LIKE 'slave_parallel_workers'" | awk '{print $2}')
echo "Parallel workers: $PARALLEL_WORKERS"

Step 3: Enable GTID Mode (Online Migration)

Pre‑checks to ensure compatibility (no MyISAM tables, deterministic routines) and then modify /etc/my.cnf on master and slave:

# Master my.cnf additions
gtid_mode=ON_PERMISSIVE
enforce_gtid_consistency=ON
# Restart master
systemctl restart mysqld
# Slave my.cnf additions (same settings)
gtid_mode=ON_PERMISSIVE
enforce_gtid_consistency=ON
# Restart slave
systemctl restart mysqld

Switch master to pure GTID mode:

SET GLOBAL gtid_mode=ON;

On the new master, enable auto‑positioning:

STOP SLAVE;
SET GLOBAL gtid_mode=ON;
CHANGE MASTER TO MASTER_AUTO_POSITION=1;
START SLAVE;

Step 4: Configure Semi‑Synchronous Replication

Install plugin on master:

INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';
SET GLOBAL rpl_semi_sync_master_enabled=1;
SET GLOBAL rpl_semi_sync_master_timeout=1000;  -- 1 s timeout

Install plugin on slave:

INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';
SET GLOBAL rpl_semi_sync_slave_enabled=1;
STOP SLAVE IO_THREAD;
START SLAVE IO_THREAD;

Persist settings in /etc/my.cnf on both nodes.

Verify status:

SHOW STATUS LIKE 'Rpl_semi_sync_master_status';   -- should be ON
SHOW STATUS LIKE 'Rpl_semi_sync_slave_status';    -- should be ON

Step 5: Optimize Multi‑Threaded Replication

MySQL 5.7 configuration:

# /etc/my.cnf
slave_parallel_type=LOGICAL_CLOCK
slave_parallel_workers=4   -- recommend = CPU cores
slave_preserve_commit_order=1

MySQL 8.0 enhanced configuration:

# /etc/my.cnf
slave_parallel_type=LOGICAL_CLOCK
slave_parallel_workers=8
binlog_transaction_dependency_tracking=WRITESET
transaction_write_set_extraction=XXHASH64

Restart slave after changes.

Step 6: Monitoring Replication Lag & Alerts

Expose metrics with mysqld_exporter and create Prometheus alert rules:

# Prometheus rule example
- alert: MySQLReplicationLag
  expr: mysql_slave_status_seconds_behind_master > 30
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "MySQL replication lag exceeds 30 seconds"

Step 7: Test Failover (Simulate Master Crash)

Stop master, promote the most up‑to‑date slave using GTID auto‑positioning, then redirect other slaves:

# On new master
STOP SLAVE;
RESET SLAVE ALL;
SET GLOBAL read_only=0;
SET GLOBAL super_read_only=0;

# On other slaves
STOP SLAVE;
CHANGE MASTER TO MASTER_HOST='new-master-ip', MASTER_USER='repl', MASTER_PASSWORD='password', MASTER_PORT=3306, MASTER_AUTO_POSITION=1;
START SLAVE;

Validate data consistency with pt-table-checksum and fix differences using pt-table-sync.

Step 8: Deploy Automatic Failover Tool (Orchestrator)

Install Orchestrator (RPM or Docker), configure /etc/orchestrator.conf.json, create the orchestrator database and topology user, then start the service and discover the topology.

{
  "MySQLTopologyUser": "orc_user",
  "MySQLTopologyPassword": "password",
  "MySQLOrchestratorHost": "127.0.0.1",
  "MySQLOrchestratorPort": 3306,
  "MySQLOrchestratorDatabase": "orchestrator",
  "MySQLOrchestratorUser": "orc_server",
  "MySQLOrchestratorPassword": "orc_password",
  "RecoveryPeriodBlockSeconds": 300,
  "RecoverMasterClusterFilters": ["*"],
  "RecoverIntermediateMasterClusterFilters": ["*"],
  "ApplyMySQLPromotionAfterMasterFailover": true
}

Test automatic failover by stopping the master and observing Orchestrator’s log.

Monitoring & Alerting

Key metrics:

Replication lag ( Seconds_Behind_Master)

GTID execution progress ( gtid_executed)

Semi‑sync status ( Rpl_semi_sync_master_status, Rpl_semi_sync_slave_status)

Replication errors ( Last_Errno, Last_Error)

Performance & Capacity Planning

Typical bottlenecks and mitigation:

Network latency – use dedicated links or tune semi‑sync timeout

Single‑threaded replay – enable slave_parallel_workers Large transactions – split or use binlog_row_image=MINIMAL Disk I/O – upgrade to SSD, adjust

innodb_flush_log_at_trx_commit

Security & Compliance

Use a dedicated replication account with least privileges and enable SSL encryption for replication traffic.

# Create replication user
CREATE USER 'repl'@'%' IDENTIFIED BY 'StrongPassword123!';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';

# Enable SSL on slave
STOP SLAVE;
CHANGE MASTER TO MASTER_SSL=1, MASTER_SSL_CA='/var/lib/mysql/ca.pem', MASTER_SSL_CERT='/var/lib/mysql/client-cert.pem', MASTER_SSL_KEY='/var/lib/mysql/client-key.pem';
START SLAVE;

Best Practices

Always enable GTID for simplified failover.

Enable semi‑synchronous replication (at least one semi‑sync slave).

Use multi‑threaded replication (e.g., slave_parallel_workers ≥ 4).

Set slaves to read‑only ( read_only=1, super_read_only=1).

Monitor replication lag with alerts < 30 s.

Run weekly data‑consistency checks with pt-table-checksum.

Deploy automatic failover tools (Orchestrator/MHA) in production.

Avoid manual writes on slaves to preserve GTID consistency.

Split large transactions (< 1 GB) to prevent replication stalls.

Quarterly failover drills with data validation.

Appendix – Full Configuration Templates

Master my.cnf

[mysqld]
server_id=1
log_bin=mysql-bin
binlog_format=ROW
gtid_mode=ON
enforce_gtid_consistency=ON
log_slave_updates=ON
rpl_semi_sync_master_enabled=1
rpl_semi_sync_master_timeout=1000
expire_logs_days=7
max_binlog_size=1G
sync_binlog=1
innodb_flush_log_at_trx_commit=1

Slave my.cnf

[mysqld]
server_id=2
log_bin=mysql-bin
binlog_format=ROW
gtid_mode=ON
enforce_gtid_consistency=ON
log_slave_updates=ON
read_only=1
super_read_only=1
rpl_semi_sync_slave_enabled=1
slave_parallel_type=LOGICAL_CLOCK
slave_parallel_workers=8
slave_preserve_commit_order=1
binlog_transaction_dependency_tracking=WRITESET
transaction_write_set_extraction=XXHASH64
relay_log=relay-bin
relay_log_recovery=ON

Ansible Playbook for Slave Configuration

---
- name: Configure MySQL master‑slave replication
  hosts: mysql_slaves
  become: yes
  vars:
    master_host: "192.168.1.10"
    repl_user: "repl"
    repl_password: "StrongPassword123!"
  tasks:
    - name: Stop slave replication
      mysql_replication:
        mode: stopslave

    - name: Set master connection
      mysql_replication:
        mode: changemaster
        master_host: "{{ master_host }}"
        master_user: "{{ repl_user }}"
        master_password: "{{ repl_password }}"
        master_auto_position: yes

    - name: Start slave replication
      mysql_replication:
        mode: startslave

    - name: Verify replication status
      mysql_replication:
        mode: getslave
      register: slave_status
      failed_when: slave_status.Slave_IO_Running != 'Yes' or slave_status.Slave_SQL_Running != 'Yes'

    - name: Show replication lag
      debug:
        msg: "Replication lag: {{ slave_status.Seconds_Behind_Master }} seconds"

Test Environment : MySQL 8.0.35 on RHEL 8.8 / Ubuntu 22.04 LTS. Test Date : 2025‑10‑31. Maintenance Cycle : Quarterly configuration review, weekly consistency checks, quarterly failover drills.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

automationmysqlReplicationfailoverGTID
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.