Databases 34 min read

Mastering MySQL 9.0 Replication: Deep Dive into GTID, Semi‑Sync, and Troubleshooting

This comprehensive guide explores MySQL 9.0's replication architecture, covering core principles, GTID and semi‑synchronous modes, various replication topologies, detailed troubleshooting steps, monitoring scripts, and best‑practice configurations, providing hands‑on examples and scripts to help engineers build and maintain robust, high‑availability database systems.

MaGe Linux Operations

Apr 2, 2026

Mastering MySQL 9.0 Replication: Deep Dive into GTID, Semi‑Sync, and Troubleshooting

Background

MySQL replication is a core component for building high‑availability database architectures. MySQL 9.0 (2026) adds enhanced GTID support, more efficient parallel replication, and richer monitoring interfaces.

Prerequisites

Linux command line and shell scripting

Basic MySQL administration

Understanding of ACID transaction properties

Fundamentals of TCP/IP networking

Experimental Environment

MySQL 9.0.2 Community Edition on Linux (Rocky 9 or Ubuntu 24.04 LTS)

GTID replication mode

Semi‑synchronous replication using the plugin architecture

Core Replication Architecture

┌─────────────────┐      ┌─────────────────┐
│   Master       │      │   Slave         │
│               │      │                 │
│  ┌─────────┐  │      │  ┌─────────┐  │
│  │ Client  │  │      │  │ SQL     │  │
│  │ Write   │  │      │  │ Thread  │  │
│  └─────┬───┘  │      │  └─────┬───┘  │
│        │      │      │        │      │
│        ▼      │      │        ▼      │
│  ┌───────────┐ │      │  ┌───────────┐ │
│  │ Binlog    │─┼──────┼─►│ Relaylog  │ │
│  │ (write)   │ │      │  │ (read)    │ │
│  └───────────┘ │      │  └─────┬─────┘ │
│               │      │        │       │
│               │      │        ▼       │
│               │      │  ┌───────────┐ │
│               │      │  │ Data      │ │
│               │      │  │ (execute) │ │
│               │      │  └───────────┘ │
└─────────────────┘      └─────────────────┘

Key Components

Binary Log (Binlog) – records all changes on the master.

IO Thread – pulls the binlog from the master to the slave.

Relay Log – local copy of the received binlog on the slave.

SQL Thread – replays events from the relay log.

Replication Modes

Asynchronous – the master returns to the client immediately after committing.

Semi‑synchronous – the master waits for at least one slave to acknowledge receipt before confirming the transaction.

Fully synchronous – all slaves must acknowledge; used by MySQL Group Replication.

Binlog Formats

Controlled by binlog_format:

STATEMENT – stores the original SQL statements (small log size, but nondeterministic functions can cause inconsistencies).

ROW – stores the actual row changes (guarantees consistency, larger log).

MIXED – default is STATEMENT, automatically switches to ROW when needed.

-- Set STATEMENT format
SET binlog_format = 'STATEMENT';

-- Set ROW format
SET binlog_format = 'ROW';

-- Set MIXED format
SET binlog_format = 'MIXED';

GTID Replication

GTID (Global Transaction Identifier) assigns a unique identifier to each transaction, simplifying replication management.

Source ID – the server UUID of the master.

Transaction ID – a sequential number for the transaction.

Example GTID:

3E11FA47-71CA-11E1-9E33-C80AA9429562:23

GTID Advantages

Automatic position tracking – no need to specify binlog file names.

Simplifies master‑to‑slave switchover and failover.

Facilitates consistency checks.

Supports automatic skipping of replication errors.

GTID Configuration (my.cnf)

# Enable GTID mode
gtid_mode = ON
enforce_gtid_consistency = ON

# Replication settings
log_slave_updates = ON
slave_parallel_type = 'LOGICAL_CLOCK'
slave_parallel_workers = 8
slave_preserve_commit_order = ON

# Binlog settings
binlog_format = ROW
sync_binlog = 1

Replication Threads

IO Thread Workflow

1. Slave connects to master
2. Slave requests binlog position (MASTER_LOG_FILE, MASTER_LOG_POS)
3. Master streams binlog data
4. Slave writes data to Relay Log
5. Slave acknowledges received position

SQL Thread Workflow

1. Read next event from Relay Log
2. Parse event content
3. Execute corresponding SQL on slave
4. Update Exec_Master_Log_Pos
5. Repeat until Relay Log is exhausted

Parallel Replication (MySQL 9.0)

# View current parallel settings
SHOW VARIABLES LIKE 'slave_parallel%';

# Enable LOGICAL_CLOCK parallelism with 16 workers
SET GLOBAL slave_parallel_type = 'LOGICAL_CLOCK';
SET GLOBAL slave_parallel_workers = 16;

Semi‑Synchronous Replication

Installation and Configuration

# Install plugins (run on master and slave respectively)
INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';
INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';

# Master configuration
SET GLOBAL rpl_semi_sync_master_enabled = 1;
SET GLOBAL rpl_semi_sync_master_timeout = 10000;  # 10 seconds
SET GLOBAL rpl_semi_sync_master_wait_for_slave_count = 1;

# Slave configuration
SET GLOBAL rpl_semi_sync_slave_enabled = 1;

# Restart slave IO thread to apply changes
STOP SLAVE IO_THREAD;
START SLAVE IO_THREAD;

# Persist settings in my.cnf
[mysqld]
rpl_semi_sync_master_enabled = 1
rpl_semi_sync_master_timeout = 10000
rpl_semi_sync_master_wait_for_slave_count = 1
rpl_semi_sync_slave_enabled = 1

Monitoring Semi‑Sync

# Master status
SHOW GLOBAL STATUS LIKE 'rpl_semi_sync%';

# Slave status
SHOW SLAVE STATUS\G

Common Fault Scenarios and Troubleshooting

IO Thread Connection Failure

Symptom : Slave_IO_Running: Connecting and error connecting to master.

Diagnostic Steps :

# Verify network connectivity
ping master.example.com
nc -zv master.example.com 3306

# Test MySQL connection from slave
mysql -h master.example.com -u repl -p -e "SELECT 1";

# Check replication user privileges on master
mysql -u root -p -e "SHOW GRANTS FOR 'repl'@'%';"

Resolution :

# Re‑create replication user with proper privileges
CREATE USER 'repl'@'slave.example.com' IDENTIFIED BY 'repl_password';
GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'repl'@'slave.example.com';
FLUSH PRIVILEGES;

# Update master connection info on slave
STOP SLAVE;
CHANGE MASTER TO MASTER_HOST='master.example.com',
  MASTER_USER='repl',
  MASTER_PASSWORD='repl_password',
  MASTER_PORT=3306,
  MASTER_AUTO_POSITION=1;
START SLAVE;

SQL Thread Error

Symptom :

Last_SQL_Error: Could not execute Update_rows event on table test.t1

Diagnostic Steps :

# Inspect error details
SHOW SLAVE STATUS\G

# Skip the problematic transaction (GTID mode)
STOP SLAVE;
SET GTID_NEXT='source_id:transaction_id';
BEGIN; COMMIT;
SET GTID_NEXT='AUTOMATIC';
START SLAVE;

Resolution :

# Repair data manually or re‑initialize the slave if inconsistencies are severe
pt-table-sync h=master,u=root,p=pass h=slave,u=root,p=pass \
  --databases=test --tables=t1 --execute

Excessive Replication Lag

Symptom : Seconds_Behind_Master: 3600 (1 hour lag).

Diagnostic Steps :

# Check slave load
SHOW PROCESSLIST;

# Inspect binlog and relay log activity
SHOW GLOBAL STATUS LIKE 'binlog%';
SHOW GLOBAL STATUS LIKE 'relay%';

# Verify network latency (semi‑sync metrics)
SELECT @@Rpl_semi_sync_master_avg_trans_packet_size / 
       @@Rpl_semi_sync_master_transactions_waited AS avg_packet_size;

# Review parallel replication status
SHOW VARIABLES LIKE 'slave_parallel%';
SELECT * FROM performance_schema.replication_applier_status_by_worker\G;

Resolution :

# Increase parallel workers
STOP SLAVE;
SET GLOBAL slave_parallel_workers = 16;
SET GLOBAL slave_parallel_type = 'LOGICAL_CLOCK';
SET GLOBAL slave_preserve_commit_order = ON;
START SLAVE;

# Optimize slow queries on the slave
SET GLOBAL slow_query_log = ON;
SET GLOBAL long_query_time = 2;

# Adjust master binlog sync frequency if needed
SET GLOBAL sync_binlog = 1000;  # reduce flush frequency

# Consider multi‑source replication to distribute load
CHANGE MASTER TO MASTER_HOST='master1.example.com', MASTER_USER='repl',
  MASTER_PASSWORD='repl_password', MASTER_AUTO_POSITION=1 FOR CHANNEL 'master1';

GTID Purge Error

Symptom : Fatal error 1236 – master has purged binary logs required by the slave.

Resolution :

# Obtain current GTID set from master
SHOW MASTER STATUS;

# Reset slave and re‑configure GTID position
STOP SLAVE;
RESET SLAVE ALL;
CHANGE MASTER TO MASTER_HOST='master.example.com',
  MASTER_USER='repl', MASTER_PASSWORD='repl_password', MASTER_AUTO_POSITION=1;
START SLAVE;

# If necessary, take a fresh backup that includes GTID information
mysqldump -u root -p --single-transaction --source-data=2 --all-databases > backup_with_gtid.sql;
mysql -u root -p < backup_with_gtid.sql;

Monitoring Scripts Collection

Comprehensive Replication Monitor

#!/bin/bash
# mysql_replication_monitor.sh – consolidated replication health check

MYSQL_USER="root"
MYSQL_PASSWORD="your_password"
ALERT_EMAIL="[email protected]"
LOG_FILE="/var/log/mysql/replication_monitor.log"

echo "$(date '+%Y-%m-%d %H:%M:%S') - Replication check" >> $LOG_FILE

SLAVE_STATUS=$(mysql -u${MYSQL_USER} -p${MYSQL_PASSWORD} -e "SHOW SLAVE STATUS\G" 2>/dev/null)
if [ -z "$SLAVE_STATUS" ]; then
  echo "This server is not a slave" >> $LOG_FILE
  exit 0
fi

IO_RUNNING=$(echo "$SLAVE_STATUS" | grep "Slave_IO_Running:" | awk '{print $2}')
SQL_RUNNING=$(echo "$SLAVE_STATUS" | grep "Slave_SQL_Running:" | awk '{print $2}')
SECONDS_BEHIND=$(echo "$SLAVE_STATUS" | grep "Seconds_Behind_Master:" | awk '{print $2}')
LAST_ERROR=$(echo "$SLAVE_STATUS" | grep "Last_Error:" | cut -d: -f2-)
LAST_IO_ERROR=$(echo "$SLAVE_STATUS" | grep "Last_IO_Error:" | cut -d: -f2-)

ALERT_MSG=""
if [ "$IO_RUNNING" != "Yes" ]; then
  ALERT_MSG+="[CRITICAL] IO thread not running: $LAST_IO_ERROR
"
fi
if [ "$SQL_RUNNING" != "Yes" ]; then
  ALERT_MSG+="[CRITICAL] SQL thread not running: $LAST_ERROR
"
fi
if [ -n "$SECONDS_BEHIND" ] && [ "$SECONDS_BEHIND" != "NULL" ]; then
  if [ $SECONDS_BEHIND -gt 300 ]; then
    ALERT_MSG+="[CRITICAL] Replication lag > 5 minutes ($SECONDS_BEHIND s)
"
  elif [ $SECONDS_BEHIND -gt 60 ]; then
    ALERT_MSG+="[WARNING] Replication lag > 1 minute ($SECONDS_BEHIND s)
"
  fi
fi

if [ -n "$ALERT_MSG" ]; then
  echo -e "$ALERT_MSG" >> $LOG_FILE
  # Uncomment to send email alerts
  # echo -e "$ALERT_MSG" | mail -s "MySQL Replication Alert" $ALERT_EMAIL
fi

echo "Check completed" >> $LOG_FILE

Failover (Master Promotion) Script

#!/bin/bash
# failover_master.sh – promote a slave to master

NEW_MASTER="slave1.example.com"
OLD_MASTER="master.example.com"
MYSQL_USER="root"
MYSQL_PASSWORD="your_password"

echo "=== MySQL Master Failover ==="

# 1. Verify all slaves have caught up (placeholder)
for SLAVE in slave1 slave2 slave3; do
  echo "Checking $SLAVE..."
  # In practice, compare Master_Log_File and Read_Master_Log_Pos via SSH
done

# 2. Stop replication on all slaves
for SLAVE in slave1 slave2 slave3; do
  echo "Stopping replication on $SLAVE..."
  # ssh $SLAVE "mysql -u$MYSQL_USER -p$MYSQL_PASSWORD -e 'STOP SLAVE;'"
 done

# 3. Promote the new master
mysql -h $NEW_MASTER -u$MYSQL_USER -p$MYSQL_PASSWORD -e "
  STOP SLAVE;
  RESET SLAVE ALL;
  SET GLOBAL read_only = OFF;
  SET GLOBAL super_read_only = OFF;
"

# 4. Re‑configure remaining slaves to follow the new master
for SLAVE in slave2 slave3; do
  echo "Re‑configuring $SLAVE to replicate from $NEW_MASTER..."
  # ssh $SLAVE "mysql -u$MYSQL_USER -p$MYSQL_PASSWORD -e \"
  #   CHANGE MASTER TO MASTER_HOST='$NEW_MASTER', MASTER_USER='repl', MASTER_PASSWORD='repl_password', MASTER_AUTO_POSITION=1;\"
  #   START SLAVE;"
 done

echo "Failover complete – update application connection strings accordingly."

Best‑Practice Recommendations

Configuration Guidelines

# Master configuration
[mysqld]
server-id = 1
log-bin = mysql-bin
binlog_format = ROW
sync_binlog = 1
gtid_mode = ON
enforce_gtid_consistency = ON
max_connections = 2000
innodb_flush_log_at_trx_commit = 1

# Slave configuration
[mysqld]
server-id = 2
gtid_mode = ON
enforce_gtid_consistency = ON
log_slave_updates = ON
slave_parallel_type = LOGICAL_CLOCK
slave_parallel_workers = 8
slave_preserve_commit_order = ON
read_only = ON
super_read_only = ON
slave_skip_errors = ddl_exist_errors

Operational Checklist

Daily: SHOW SLAVE STATUS\G, SHOW PROCESSLIST, monitor rpl_% status variables.

Weekly: Review performance_schema.replication_applier_status_by_worker, InnoDB log wait statistics.

Monthly: Run pt-table-checksum for data consistency, purge old binary logs, back up current binlog positions.

Key Parameters

sync_binlog = 1 – flush binlog to disk after each transaction.

gtid_mode = ON – enable GTID replication.

enforce_gtid_consistency = ON – ensure only GTID‑compatible statements.

slave_parallel_workers – set to 50‑80 % of CPU cores.

slave_preserve_commit_order = ON – maintain transaction commit order.

read_only / super_read_only = ON – enforce read‑only mode on slaves.

relay_log_purge = ON – automatically purge old relay logs.

Data Consistency Verification

Using pt‑table‑checksum

#!/bin/bash
# checksum_verify.sh – verify master‑slave data consistency

PT_TABLE_CHECKSUM="/usr/bin/pt-table-checksum"
MYSQL_OPTS="--user=root --password=your_password"

# 1. Create checksum table on all databases
$PT_TABLE_CHECKSUM $MYSQL_OPTS \
  --replicate=test.checksums \
  --create-replicate-table

# 2. Run checksum on the master
$PT_TABLE_CHECKSUM $MYSQL_OPTS \
  --replicate=test.checksums \
  --databases=test \
  --tables='orders,customers,products' \
  --no-check-binlog-format

# 3. Show differences
mysql -u root -p -e "
SELECT db, tbl, this_crc, master_crc, this_cnt, master_cnt, DIFFS
FROM test.checksums
WHERE DIFFS > 0 OR master_crc IS NULL;"

Manual Row‑Count Comparison

#!/bin/bash
# manual_data_compare.sh – simple row‑count and checksum comparison

MASTER_HOST="master.example.com"
SLAVE_HOST="slave.example.com"
USER="root"
PASS="your_password"
DB="test"
TABLE="orders"

MASTER_DATA=$(mysql -h $MASTER_HOST -u $USER -p$PASS -N -e "SELECT COUNT(*), SUM(id), SUM(amount) FROM $DB.$TABLE;")
SLAVE_DATA=$(mysql -h $SLAVE_HOST -u $USER -p$PASS -N -e "SELECT COUNT(*), SUM(id), SUM(amount) FROM $DB.$TABLE;")

echo "Master: $MASTER_DATA"
echo "Slave:  $SLAVE_DATA"

if [ "$MASTER_DATA" = "$SLAVE_DATA" ]; then
  echo "[CONSISTENT] Data matches"
  exit 0
else
  echo "[INCONSISTENT] Data differs"
  exit 1
fi

Failover and Recovery

Master Failover Procedure

#!/bin/bash
# failover_master.sh – step‑by‑step master switchover

NEW_MASTER="slave1.example.com"
OLD_MASTER="master.example.com"
MYSQL_USER="root"
MYSQL_PASSWORD="your_password"

# 1. Ensure all slaves have caught up (implementation omitted)
# 2. Stop replication on all slaves
for SLAVE in slave1 slave2 slave3; do
  mysql -h $SLAVE -u$MYSQL_USER -p$MYSQL_PASSWORD -e "STOP SLAVE;"
done

# 3. Promote the chosen slave
mysql -h $NEW_MASTER -u$MYSQL_USER -p$MYSQL_PASSWORD -e "
  STOP SLAVE;
  RESET SLAVE ALL;
  SET GLOBAL read_only = OFF;
  SET GLOBAL super_read_only = OFF;
"

# 4. Point remaining slaves to the new master
for SLAVE in slave2 slave3; do
  mysql -h $SLAVE -u$MYSQL_USER -p$MYSQL_PASSWORD -e "
    CHANGE MASTER TO MASTER_HOST='$NEW_MASTER',
      MASTER_USER='repl',
      MASTER_PASSWORD='repl_password',
      MASTER_AUTO_POSITION=1;
    START SLAVE;"
done

echo "Failover complete – update application connection strings."

Slave Recovery Procedure

#!/bin/bash
# recover_slave.sh – restore a failed slave from backup

SLAVE_HOST="slave2.example.com"
MASTER_HOST="master.example.com"
BACKUP_DIR="/backup/mysql"
MYSQL_USER="root"
MYSQL_PASSWORD="your_password"

# 1. Stop replication on the failed slave
mysql -h $SLAVE_HOST -u$MYSQL_USER -p$MYSQL_PASSWORD -e "STOP SLAVE;"

# 2. Take a fresh logical backup from the master (incremental if needed)
mysqldump -h $MASTER_HOST -u$MYSQL_USER -p$MYSQL_PASSWORD \
  --single-transaction --source-data=2 --routines --triggers --events \
  --all-databases > $BACKUP_DIR/full_backup_$(date +%Y%m%d_%H%M%S).sql

# 3. Restore the backup on the slave
mysql -h $SLAVE_HOST -u$MYSQL_USER -p$MYSQL_PASSWORD < $BACKUP_DIR/full_backup_latest.sql

# 4. Re‑configure replication
mysql -h $SLAVE_HOST -u$MYSQL_USER -p$MYSQL_PASSWORD -e "
  RESET SLAVE ALL;
  CHANGE MASTER TO MASTER_HOST='$MASTER_HOST',
    MASTER_USER='repl',
    MASTER_PASSWORD='repl_password',
    MASTER_PORT=3306,
    MASTER_AUTO_POSITION=1;
  START SLAVE;"

# 5. Verify replication status
sleep 5
mysql -h $SLAVE_HOST -u$MYSQL_USER -p$MYSQL_PASSWORD -e "SHOW SLAVE STATUS\G" | grep -E "Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master"

Additional Monitoring Scripts

Replication Lag Monitor

#!/bin/bash
# check_replication_lag.sh – alert on excessive lag

MYSQL_USER="root"
MYSQL_PASSWORD="your_password"

LAG=$(mysql -u${MYSQL_USER} -p${MYSQL_PASSWORD} -N -e "SHOW SLAVE STATUS" 2>/dev/null | awk '{print $40}')

if [ -z "$LAG" ] || [ "$LAG" = "NULL" ]; then
  echo "[ERROR] Unable to retrieve lag or not a slave"
  exit 2
fi

echo "Current lag: $LAG seconds"

if [ $LAG -gt 300 ]; then
  echo "[CRITICAL] Lag > 5 minutes"
  exit 2
elif [ $LAG -gt 60 ]; then
  echo "[WARNING] Lag > 1 minute"
  exit 1
else
  echo "[OK] Lag within acceptable range"
  exit 0
fi

Summary of Best Practices

Use ROW binlog format for production to avoid nondeterministic statement issues.

Enable GTID mode and enforce GTID consistency for easier failover.

Configure slave_parallel_type = 'LOGICAL_CLOCK' and allocate slave_parallel_workers to 50‑80 % of CPU cores.

Set sync_binlog = 1 on the master for durability; adjust on the slave if latency becomes a bottleneck.

Keep read_only and super_read_only enabled on all slaves.

Monitor Slave_IO_Running, Slave_SQL_Running, Seconds_Behind_Master, and the rpl_semi_sync% status variables.

Run pt-table-checksum regularly and fix differences with pt-table-sync.

Maintain a daily operational checklist (status, processlist, replication variables) and a weekly review of parallel worker health.

Test failover and recovery procedures in a staging environment before production use.

Key Takeaways

Replication consists of binlog generation on the master, IO thread fetching, and SQL thread applying.

GTID simplifies position tracking and automatic failover.

Select the replication mode that matches your consistency vs. performance requirements.

IO thread failures are usually network or authentication problems; SQL thread failures often stem from data inconsistencies.

Effective monitoring of thread status, lag, and error messages is essential for early detection.

Regular consistency checks and well‑documented recovery scripts reduce MTTR during incidents.

Further Learning

MySQL 9.0 Official Documentation – Replication chapter.

"High Performance MySQL" (4th Edition) – in‑depth coverage of replication internals.

MySQL Server Blog – latest feature announcements and best‑practice articles.

Practicing the scripts, performing controlled failure simulations, and integrating the monitoring checks into your observability stack will build confidence in operating robust MySQL replication topologies.

Performance MySQL replication GTID

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Prerequisites

Experimental Environment

Core Replication Architecture

Key Components

Replication Modes

Binlog Formats

GTID Replication

GTID Advantages

GTID Configuration (my.cnf)

Replication Threads

IO Thread Workflow

SQL Thread Workflow

Parallel Replication (MySQL 9.0)

Semi‑Synchronous Replication

Installation and Configuration

Monitoring Semi‑Sync

Common Fault Scenarios and Troubleshooting

IO Thread Connection Failure

SQL Thread Error

Excessive Replication Lag

GTID Purge Error

Monitoring Scripts Collection

Comprehensive Replication Monitor

Failover (Master Promotion) Script

Best‑Practice Recommendations

Configuration Guidelines

Operational Checklist

Key Parameters

Data Consistency Verification

Using pt‑table‑checksum

Manual Row‑Count Comparison

Failover and Recovery

Master Failover Procedure

Slave Recovery Procedure

Additional Monitoring Scripts

Replication Lag Monitor

Summary of Best Practices

Key Takeaways

Further Learning

MaGe Linux Operations

How this landed with the community

Was this worth your time?

0 Comments

Parallel Replication (MySQL 9.0)