Mastering MySQL 9.0 Replication: Deep Dive into GTID, Semi‑Sync, and Troubleshooting
This comprehensive guide explores MySQL 9.0's replication architecture, covering core principles, GTID and semi‑synchronous modes, various replication topologies, detailed troubleshooting steps, monitoring scripts, and best‑practice configurations, providing hands‑on examples and scripts to help engineers build and maintain robust, high‑availability database systems.
Background
MySQL replication is a core component for building high‑availability database architectures. MySQL 9.0 (2026) adds enhanced GTID support, more efficient parallel replication, and richer monitoring interfaces.
Prerequisites
Linux command line and shell scripting
Basic MySQL administration
Understanding of ACID transaction properties
Fundamentals of TCP/IP networking
Experimental Environment
MySQL 9.0.2 Community Edition on Linux (Rocky 9 or Ubuntu 24.04 LTS)
GTID replication mode
Semi‑synchronous replication using the plugin architecture
Core Replication Architecture
┌─────────────────┐ ┌─────────────────┐
│ Master │ │ Slave │
│ │ │ │
│ ┌─────────┐ │ │ ┌─────────┐ │
│ │ Client │ │ │ │ SQL │ │
│ │ Write │ │ │ │ Thread │ │
│ └─────┬───┘ │ │ └─────┬───┘ │
│ │ │ │ │ │
│ ▼ │ │ ▼ │
│ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Binlog │─┼──────┼─►│ Relaylog │ │
│ │ (write) │ │ │ │ (read) │ │
│ └───────────┘ │ │ └─────┬─────┘ │
│ │ │ │ │
│ │ │ ▼ │
│ │ │ ┌───────────┐ │
│ │ │ │ Data │ │
│ │ │ │ (execute) │ │
│ │ │ └───────────┘ │
└─────────────────┘ └─────────────────┘Key Components
Binary Log (Binlog) – records all changes on the master.
IO Thread – pulls the binlog from the master to the slave.
Relay Log – local copy of the received binlog on the slave.
SQL Thread – replays events from the relay log.
Replication Modes
Asynchronous – the master returns to the client immediately after committing.
Semi‑synchronous – the master waits for at least one slave to acknowledge receipt before confirming the transaction.
Fully synchronous – all slaves must acknowledge; used by MySQL Group Replication.
Binlog Formats
Controlled by binlog_format:
STATEMENT – stores the original SQL statements (small log size, but nondeterministic functions can cause inconsistencies).
ROW – stores the actual row changes (guarantees consistency, larger log).
MIXED – default is STATEMENT, automatically switches to ROW when needed.
-- Set STATEMENT format
SET binlog_format = 'STATEMENT';
-- Set ROW format
SET binlog_format = 'ROW';
-- Set MIXED format
SET binlog_format = 'MIXED';GTID Replication
GTID (Global Transaction Identifier) assigns a unique identifier to each transaction, simplifying replication management.
Source ID – the server UUID of the master.
Transaction ID – a sequential number for the transaction.
Example GTID:
3E11FA47-71CA-11E1-9E33-C80AA9429562:23GTID Advantages
Automatic position tracking – no need to specify binlog file names.
Simplifies master‑to‑slave switchover and failover.
Facilitates consistency checks.
Supports automatic skipping of replication errors.
GTID Configuration (my.cnf)
# Enable GTID mode
gtid_mode = ON
enforce_gtid_consistency = ON
# Replication settings
log_slave_updates = ON
slave_parallel_type = 'LOGICAL_CLOCK'
slave_parallel_workers = 8
slave_preserve_commit_order = ON
# Binlog settings
binlog_format = ROW
sync_binlog = 1Replication Threads
IO Thread Workflow
1. Slave connects to master
2. Slave requests binlog position (MASTER_LOG_FILE, MASTER_LOG_POS)
3. Master streams binlog data
4. Slave writes data to Relay Log
5. Slave acknowledges received positionSQL Thread Workflow
1. Read next event from Relay Log
2. Parse event content
3. Execute corresponding SQL on slave
4. Update Exec_Master_Log_Pos
5. Repeat until Relay Log is exhaustedParallel Replication (MySQL 9.0)
# View current parallel settings
SHOW VARIABLES LIKE 'slave_parallel%';
# Enable LOGICAL_CLOCK parallelism with 16 workers
SET GLOBAL slave_parallel_type = 'LOGICAL_CLOCK';
SET GLOBAL slave_parallel_workers = 16;Semi‑Synchronous Replication
Installation and Configuration
# Install plugins (run on master and slave respectively)
INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';
INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';
# Master configuration
SET GLOBAL rpl_semi_sync_master_enabled = 1;
SET GLOBAL rpl_semi_sync_master_timeout = 10000; # 10 seconds
SET GLOBAL rpl_semi_sync_master_wait_for_slave_count = 1;
# Slave configuration
SET GLOBAL rpl_semi_sync_slave_enabled = 1;
# Restart slave IO thread to apply changes
STOP SLAVE IO_THREAD;
START SLAVE IO_THREAD;
# Persist settings in my.cnf
[mysqld]
rpl_semi_sync_master_enabled = 1
rpl_semi_sync_master_timeout = 10000
rpl_semi_sync_master_wait_for_slave_count = 1
rpl_semi_sync_slave_enabled = 1Monitoring Semi‑Sync
# Master status
SHOW GLOBAL STATUS LIKE 'rpl_semi_sync%';
# Slave status
SHOW SLAVE STATUS\GCommon Fault Scenarios and Troubleshooting
IO Thread Connection Failure
Symptom : Slave_IO_Running: Connecting and error connecting to master.
Diagnostic Steps :
# Verify network connectivity
ping master.example.com
nc -zv master.example.com 3306
# Test MySQL connection from slave
mysql -h master.example.com -u repl -p -e "SELECT 1";
# Check replication user privileges on master
mysql -u root -p -e "SHOW GRANTS FOR 'repl'@'%';"Resolution :
# Re‑create replication user with proper privileges
CREATE USER 'repl'@'slave.example.com' IDENTIFIED BY 'repl_password';
GRANT REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'repl'@'slave.example.com';
FLUSH PRIVILEGES;
# Update master connection info on slave
STOP SLAVE;
CHANGE MASTER TO MASTER_HOST='master.example.com',
MASTER_USER='repl',
MASTER_PASSWORD='repl_password',
MASTER_PORT=3306,
MASTER_AUTO_POSITION=1;
START SLAVE;SQL Thread Error
Symptom :
Last_SQL_Error: Could not execute Update_rows event on table test.t1Diagnostic Steps :
# Inspect error details
SHOW SLAVE STATUS\G
# Skip the problematic transaction (GTID mode)
STOP SLAVE;
SET GTID_NEXT='source_id:transaction_id';
BEGIN; COMMIT;
SET GTID_NEXT='AUTOMATIC';
START SLAVE;Resolution :
# Repair data manually or re‑initialize the slave if inconsistencies are severe
pt-table-sync h=master,u=root,p=pass h=slave,u=root,p=pass \
--databases=test --tables=t1 --executeExcessive Replication Lag
Symptom : Seconds_Behind_Master: 3600 (1 hour lag).
Diagnostic Steps :
# Check slave load
SHOW PROCESSLIST;
# Inspect binlog and relay log activity
SHOW GLOBAL STATUS LIKE 'binlog%';
SHOW GLOBAL STATUS LIKE 'relay%';
# Verify network latency (semi‑sync metrics)
SELECT @@Rpl_semi_sync_master_avg_trans_packet_size /
@@Rpl_semi_sync_master_transactions_waited AS avg_packet_size;
# Review parallel replication status
SHOW VARIABLES LIKE 'slave_parallel%';
SELECT * FROM performance_schema.replication_applier_status_by_worker\G;Resolution :
# Increase parallel workers
STOP SLAVE;
SET GLOBAL slave_parallel_workers = 16;
SET GLOBAL slave_parallel_type = 'LOGICAL_CLOCK';
SET GLOBAL slave_preserve_commit_order = ON;
START SLAVE;
# Optimize slow queries on the slave
SET GLOBAL slow_query_log = ON;
SET GLOBAL long_query_time = 2;
# Adjust master binlog sync frequency if needed
SET GLOBAL sync_binlog = 1000; # reduce flush frequency
# Consider multi‑source replication to distribute load
CHANGE MASTER TO MASTER_HOST='master1.example.com', MASTER_USER='repl',
MASTER_PASSWORD='repl_password', MASTER_AUTO_POSITION=1 FOR CHANNEL 'master1';GTID Purge Error
Symptom : Fatal error 1236 – master has purged binary logs required by the slave.
Resolution :
# Obtain current GTID set from master
SHOW MASTER STATUS;
# Reset slave and re‑configure GTID position
STOP SLAVE;
RESET SLAVE ALL;
CHANGE MASTER TO MASTER_HOST='master.example.com',
MASTER_USER='repl', MASTER_PASSWORD='repl_password', MASTER_AUTO_POSITION=1;
START SLAVE;
# If necessary, take a fresh backup that includes GTID information
mysqldump -u root -p --single-transaction --source-data=2 --all-databases > backup_with_gtid.sql;
mysql -u root -p < backup_with_gtid.sql;Monitoring Scripts Collection
Comprehensive Replication Monitor
#!/bin/bash
# mysql_replication_monitor.sh – consolidated replication health check
MYSQL_USER="root"
MYSQL_PASSWORD="your_password"
ALERT_EMAIL="[email protected]"
LOG_FILE="/var/log/mysql/replication_monitor.log"
echo "$(date '+%Y-%m-%d %H:%M:%S') - Replication check" >> $LOG_FILE
SLAVE_STATUS=$(mysql -u${MYSQL_USER} -p${MYSQL_PASSWORD} -e "SHOW SLAVE STATUS\G" 2>/dev/null)
if [ -z "$SLAVE_STATUS" ]; then
echo "This server is not a slave" >> $LOG_FILE
exit 0
fi
IO_RUNNING=$(echo "$SLAVE_STATUS" | grep "Slave_IO_Running:" | awk '{print $2}')
SQL_RUNNING=$(echo "$SLAVE_STATUS" | grep "Slave_SQL_Running:" | awk '{print $2}')
SECONDS_BEHIND=$(echo "$SLAVE_STATUS" | grep "Seconds_Behind_Master:" | awk '{print $2}')
LAST_ERROR=$(echo "$SLAVE_STATUS" | grep "Last_Error:" | cut -d: -f2-)
LAST_IO_ERROR=$(echo "$SLAVE_STATUS" | grep "Last_IO_Error:" | cut -d: -f2-)
ALERT_MSG=""
if [ "$IO_RUNNING" != "Yes" ]; then
ALERT_MSG+="[CRITICAL] IO thread not running: $LAST_IO_ERROR
"
fi
if [ "$SQL_RUNNING" != "Yes" ]; then
ALERT_MSG+="[CRITICAL] SQL thread not running: $LAST_ERROR
"
fi
if [ -n "$SECONDS_BEHIND" ] && [ "$SECONDS_BEHIND" != "NULL" ]; then
if [ $SECONDS_BEHIND -gt 300 ]; then
ALERT_MSG+="[CRITICAL] Replication lag > 5 minutes ($SECONDS_BEHIND s)
"
elif [ $SECONDS_BEHIND -gt 60 ]; then
ALERT_MSG+="[WARNING] Replication lag > 1 minute ($SECONDS_BEHIND s)
"
fi
fi
if [ -n "$ALERT_MSG" ]; then
echo -e "$ALERT_MSG" >> $LOG_FILE
# Uncomment to send email alerts
# echo -e "$ALERT_MSG" | mail -s "MySQL Replication Alert" $ALERT_EMAIL
fi
echo "Check completed" >> $LOG_FILEFailover (Master Promotion) Script
#!/bin/bash
# failover_master.sh – promote a slave to master
NEW_MASTER="slave1.example.com"
OLD_MASTER="master.example.com"
MYSQL_USER="root"
MYSQL_PASSWORD="your_password"
echo "=== MySQL Master Failover ==="
# 1. Verify all slaves have caught up (placeholder)
for SLAVE in slave1 slave2 slave3; do
echo "Checking $SLAVE..."
# In practice, compare Master_Log_File and Read_Master_Log_Pos via SSH
done
# 2. Stop replication on all slaves
for SLAVE in slave1 slave2 slave3; do
echo "Stopping replication on $SLAVE..."
# ssh $SLAVE "mysql -u$MYSQL_USER -p$MYSQL_PASSWORD -e 'STOP SLAVE;'"
done
# 3. Promote the new master
mysql -h $NEW_MASTER -u$MYSQL_USER -p$MYSQL_PASSWORD -e "
STOP SLAVE;
RESET SLAVE ALL;
SET GLOBAL read_only = OFF;
SET GLOBAL super_read_only = OFF;
"
# 4. Re‑configure remaining slaves to follow the new master
for SLAVE in slave2 slave3; do
echo "Re‑configuring $SLAVE to replicate from $NEW_MASTER..."
# ssh $SLAVE "mysql -u$MYSQL_USER -p$MYSQL_PASSWORD -e \"
# CHANGE MASTER TO MASTER_HOST='$NEW_MASTER', MASTER_USER='repl', MASTER_PASSWORD='repl_password', MASTER_AUTO_POSITION=1;\"
# START SLAVE;"
done
echo "Failover complete – update application connection strings accordingly."Best‑Practice Recommendations
Configuration Guidelines
# Master configuration
[mysqld]
server-id = 1
log-bin = mysql-bin
binlog_format = ROW
sync_binlog = 1
gtid_mode = ON
enforce_gtid_consistency = ON
max_connections = 2000
innodb_flush_log_at_trx_commit = 1
# Slave configuration
[mysqld]
server-id = 2
gtid_mode = ON
enforce_gtid_consistency = ON
log_slave_updates = ON
slave_parallel_type = LOGICAL_CLOCK
slave_parallel_workers = 8
slave_preserve_commit_order = ON
read_only = ON
super_read_only = ON
slave_skip_errors = ddl_exist_errorsOperational Checklist
Daily: SHOW SLAVE STATUS\G, SHOW PROCESSLIST, monitor rpl_% status variables.
Weekly: Review performance_schema.replication_applier_status_by_worker, InnoDB log wait statistics.
Monthly: Run pt-table-checksum for data consistency, purge old binary logs, back up current binlog positions.
Key Parameters
sync_binlog = 1 – flush binlog to disk after each transaction.
gtid_mode = ON – enable GTID replication.
enforce_gtid_consistency = ON – ensure only GTID‑compatible statements.
slave_parallel_workers – set to 50‑80 % of CPU cores.
slave_preserve_commit_order = ON – maintain transaction commit order.
read_only / super_read_only = ON – enforce read‑only mode on slaves.
relay_log_purge = ON – automatically purge old relay logs.
Data Consistency Verification
Using pt‑table‑checksum
#!/bin/bash
# checksum_verify.sh – verify master‑slave data consistency
PT_TABLE_CHECKSUM="/usr/bin/pt-table-checksum"
MYSQL_OPTS="--user=root --password=your_password"
# 1. Create checksum table on all databases
$PT_TABLE_CHECKSUM $MYSQL_OPTS \
--replicate=test.checksums \
--create-replicate-table
# 2. Run checksum on the master
$PT_TABLE_CHECKSUM $MYSQL_OPTS \
--replicate=test.checksums \
--databases=test \
--tables='orders,customers,products' \
--no-check-binlog-format
# 3. Show differences
mysql -u root -p -e "
SELECT db, tbl, this_crc, master_crc, this_cnt, master_cnt, DIFFS
FROM test.checksums
WHERE DIFFS > 0 OR master_crc IS NULL;"Manual Row‑Count Comparison
#!/bin/bash
# manual_data_compare.sh – simple row‑count and checksum comparison
MASTER_HOST="master.example.com"
SLAVE_HOST="slave.example.com"
USER="root"
PASS="your_password"
DB="test"
TABLE="orders"
MASTER_DATA=$(mysql -h $MASTER_HOST -u $USER -p$PASS -N -e "SELECT COUNT(*), SUM(id), SUM(amount) FROM $DB.$TABLE;")
SLAVE_DATA=$(mysql -h $SLAVE_HOST -u $USER -p$PASS -N -e "SELECT COUNT(*), SUM(id), SUM(amount) FROM $DB.$TABLE;")
echo "Master: $MASTER_DATA"
echo "Slave: $SLAVE_DATA"
if [ "$MASTER_DATA" = "$SLAVE_DATA" ]; then
echo "[CONSISTENT] Data matches"
exit 0
else
echo "[INCONSISTENT] Data differs"
exit 1
fiFailover and Recovery
Master Failover Procedure
#!/bin/bash
# failover_master.sh – step‑by‑step master switchover
NEW_MASTER="slave1.example.com"
OLD_MASTER="master.example.com"
MYSQL_USER="root"
MYSQL_PASSWORD="your_password"
# 1. Ensure all slaves have caught up (implementation omitted)
# 2. Stop replication on all slaves
for SLAVE in slave1 slave2 slave3; do
mysql -h $SLAVE -u$MYSQL_USER -p$MYSQL_PASSWORD -e "STOP SLAVE;"
done
# 3. Promote the chosen slave
mysql -h $NEW_MASTER -u$MYSQL_USER -p$MYSQL_PASSWORD -e "
STOP SLAVE;
RESET SLAVE ALL;
SET GLOBAL read_only = OFF;
SET GLOBAL super_read_only = OFF;
"
# 4. Point remaining slaves to the new master
for SLAVE in slave2 slave3; do
mysql -h $SLAVE -u$MYSQL_USER -p$MYSQL_PASSWORD -e "
CHANGE MASTER TO MASTER_HOST='$NEW_MASTER',
MASTER_USER='repl',
MASTER_PASSWORD='repl_password',
MASTER_AUTO_POSITION=1;
START SLAVE;"
done
echo "Failover complete – update application connection strings."Slave Recovery Procedure
#!/bin/bash
# recover_slave.sh – restore a failed slave from backup
SLAVE_HOST="slave2.example.com"
MASTER_HOST="master.example.com"
BACKUP_DIR="/backup/mysql"
MYSQL_USER="root"
MYSQL_PASSWORD="your_password"
# 1. Stop replication on the failed slave
mysql -h $SLAVE_HOST -u$MYSQL_USER -p$MYSQL_PASSWORD -e "STOP SLAVE;"
# 2. Take a fresh logical backup from the master (incremental if needed)
mysqldump -h $MASTER_HOST -u$MYSQL_USER -p$MYSQL_PASSWORD \
--single-transaction --source-data=2 --routines --triggers --events \
--all-databases > $BACKUP_DIR/full_backup_$(date +%Y%m%d_%H%M%S).sql
# 3. Restore the backup on the slave
mysql -h $SLAVE_HOST -u$MYSQL_USER -p$MYSQL_PASSWORD < $BACKUP_DIR/full_backup_latest.sql
# 4. Re‑configure replication
mysql -h $SLAVE_HOST -u$MYSQL_USER -p$MYSQL_PASSWORD -e "
RESET SLAVE ALL;
CHANGE MASTER TO MASTER_HOST='$MASTER_HOST',
MASTER_USER='repl',
MASTER_PASSWORD='repl_password',
MASTER_PORT=3306,
MASTER_AUTO_POSITION=1;
START SLAVE;"
# 5. Verify replication status
sleep 5
mysql -h $SLAVE_HOST -u$MYSQL_USER -p$MYSQL_PASSWORD -e "SHOW SLAVE STATUS\G" | grep -E "Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master"Additional Monitoring Scripts
Replication Lag Monitor
#!/bin/bash
# check_replication_lag.sh – alert on excessive lag
MYSQL_USER="root"
MYSQL_PASSWORD="your_password"
LAG=$(mysql -u${MYSQL_USER} -p${MYSQL_PASSWORD} -N -e "SHOW SLAVE STATUS" 2>/dev/null | awk '{print $40}')
if [ -z "$LAG" ] || [ "$LAG" = "NULL" ]; then
echo "[ERROR] Unable to retrieve lag or not a slave"
exit 2
fi
echo "Current lag: $LAG seconds"
if [ $LAG -gt 300 ]; then
echo "[CRITICAL] Lag > 5 minutes"
exit 2
elif [ $LAG -gt 60 ]; then
echo "[WARNING] Lag > 1 minute"
exit 1
else
echo "[OK] Lag within acceptable range"
exit 0
fiSummary of Best Practices
Use ROW binlog format for production to avoid nondeterministic statement issues.
Enable GTID mode and enforce GTID consistency for easier failover.
Configure slave_parallel_type = 'LOGICAL_CLOCK' and allocate slave_parallel_workers to 50‑80 % of CPU cores.
Set sync_binlog = 1 on the master for durability; adjust on the slave if latency becomes a bottleneck.
Keep read_only and super_read_only enabled on all slaves.
Monitor Slave_IO_Running, Slave_SQL_Running, Seconds_Behind_Master, and the rpl_semi_sync% status variables.
Run pt-table-checksum regularly and fix differences with pt-table-sync.
Maintain a daily operational checklist (status, processlist, replication variables) and a weekly review of parallel worker health.
Test failover and recovery procedures in a staging environment before production use.
Key Takeaways
Replication consists of binlog generation on the master, IO thread fetching, and SQL thread applying.
GTID simplifies position tracking and automatic failover.
Select the replication mode that matches your consistency vs. performance requirements.
IO thread failures are usually network or authentication problems; SQL thread failures often stem from data inconsistencies.
Effective monitoring of thread status, lag, and error messages is essential for early detection.
Regular consistency checks and well‑documented recovery scripts reduce MTTR during incidents.
Further Learning
MySQL 9.0 Official Documentation – Replication chapter.
"High Performance MySQL" (4th Edition) – in‑depth coverage of replication internals.
MySQL Server Blog – latest feature announcements and best‑practice articles.
Practicing the scripts, performing controlled failure simulations, and integrating the monitoring checks into your observability stack will build confidence in operating robust MySQL replication topologies.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
