How to Build a High‑Availability MySQL Master‑Slave Cluster and Automate Failover
This guide walks through the reasons for MySQL master‑slave replication, explains its core mechanisms, details step‑by‑step environment planning, configuration, data initialization, replication setup, monitoring, failover with MHA, read‑write splitting using ProxySQL, performance tuning, troubleshooting, and best‑practice recommendations for enterprise‑grade high availability.
MySQL Master‑Slave Replication Architecture Setup and Failover
Introduction : A production outage at 3 am highlighted the lack of a master‑slave replication architecture, turning the database into a single point of failure. This article shares lessons learned, common pitfalls, and a complete, production‑validated solution.
Why MySQL Needs Master‑Slave Replication
Replication eliminates single‑point failures, provides high‑availability, enables read‑write separation, offers live backups, and allows seamless scaling.
High availability: a slave can take over instantly when the master fails.
Read‑write separation: query load is distributed across slaves, reducing response time by up to 60%.
Data safety: real‑time synchronized slaves act as live backups.
Scalability: additional slaves can be added without downtime.
Replication Principles
Three threads are involved:
Binlog Dump thread on the master reads the binary log and streams it.
I/O thread on the slave receives the binlog and writes it to a relay log.
SQL thread on the slave reads the relay log and executes the statements.
Binlog Format Choices
Statement : small log size, but functions like NOW() can cause inconsistencies.
Row : best consistency, larger log size.
Mixed : MySQL decides automatically; recommended for most cases.
GTID Overview
GTID provides a globally unique transaction ID, simplifying failover positioning and ensuring data consistency across complex topologies.
Production‑Ready Architecture Build
Environment Planning
CPU: master ≥ 8 cores, slaves ≥ 70 % of master.
Memory: master 32 GB+, slaves ≥ 16 GB.
Disk: SSD with ≥ 10 000 IOPS.
Network: 1 Gbps, latency < 1 ms between nodes.
OS: CentOS 7.9 or Ubuntu 20.04 LTS.
MySQL 8.0.30+ and monitoring tools (Prometheus + Grafana).
Master Configuration
[mysqld]
server-id = 1
port = 3306
basedir = /usr/local/mysql
datadir = /data/mysql/data
socket = /tmp/mysql.sock
pid-file = /data/mysql/mysql.pid
character-set-server = utf8mb4
collation-server = utf8mb4_general_ci
log-bin = /data/mysql/binlog/mysql-bin
binlog_format = mixed
binlog_row_image = minimal
max_binlog_size = 1G
binlog_expire_logs_seconds = 604800
sync_binlog = 1
gtid_mode = ON
enforce_gtid_consistency = ON
binlog_gtid_simple_recovery = 1
binlog_cache_size = 4M
max_binlog_cache_size = 2G
binlog_stmt_cache_size = 4M
binlog_transaction_dependency_tracking = WRITESET
transaction_write_set_extraction = XXHASH64
innodb_buffer_pool_size = 20G
innodb_log_file_size = 2G
innodb_log_buffer_size = 64M
innodb_flush_log_at_trx_commit = 1
innodb_flush_method = O_DIRECT
innodb_io_capacity = 10000
innodb_io_capacity_max = 20000
innodb_buffer_pool_instances = 8
max_connections = 3000
max_connect_errors = 1000000
thread_cache_size = 128
thread_stack = 256K
slow_query_log = 1
slow_query_log_file = /data/mysql/logs/slow.log
long_query_time = 1
log_queries_not_using_indexes = 1
log_error = /data/mysql/logs/error.log
log_error_verbosity = 2
performance_schema = ON
performance_schema_instrument = '%=ON'Slave Configuration
[mysqld]
server-id = 2
port = 3306
basedir = /usr/local/mysql
datadir = /data/mysql/data
socket = /tmp/mysql.sock
relay_log = /data/mysql/relaylog/relay-bin
relay_log_index = /data/mysql/relaylog/relay-bin.index
relay_log_info_repository = TABLE
master_info_repository = TABLE
relay_log_recovery = ON
relay_log_purge = ON
read_only = ON
super_read_only = ON
gtid_mode = ON
enforce_gtid_consistency = ON
slave_parallel_type = LOGICAL_CLOCK
slave_parallel_workers = 8
slave_preserve_commit_order = ON
slave_pending_jobs_size_max = 134217728
innodb_flush_log_at_trx_commit = 2
innodb_buffer_pool_size = 16GStep‑by‑Step Setup
Step 1: Create Replication User
CREATE USER 'replicator'@'%' IDENTIFIED WITH mysql_native_password BY 'Repl@2024Strong';
GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'%';
FLUSH PRIVILEGES;
SHOW GRANTS FOR 'replicator'@'%';Step 2: Backup Master Data (Percona XtraBackup)
# Install XtraBackup
yum install -y https://repo.percona.com/yum/percona-release-latest.noarch.rpm
percona-release setup ps80
yum install -y percona-xtrabackup-80
# Perform backup
xtrabackup --defaults-file=/etc/my.cnf \
--user=root --password='YourRootPassword' \
--backup --target-dir=/backup/full \
--parallel=4 --compress --compress-threads=4
# Prepare backup
xtrabackup --prepare --target-dir=/backup/full
# Get binlog position
cat /backup/full/xtrabackup_binlog_infoStep 3: Restore to Slave
# Stop MySQL on slave
systemctl stop mysqld
# Clean data directory
rm -rf /data/mysql/data/*
# Copy backup data
xtrabackup --copy-back --target-dir=/backup/full --datadir=/data/mysql/data
chown -R mysql:mysql /data/mysql/data
# Start MySQL
systemctl start mysqldStep 4: Configure Replication
CHANGE MASTER TO MASTER_HOST='192.168.1.100',
MASTER_USER='replicator',
MASTER_PASSWORD='Repl@2024Strong',
MASTER_PORT=3306,
MASTER_AUTO_POSITION=1;
START SLAVE;
SHOW SLAVE STATUS\GHigh‑Availability and Failover with MHA
MHA can promote a slave within 30 seconds with zero data loss.
# Install MHA Manager
yum install -y perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch \
perl-Parallel-ForkManager perl-ExtUtils-CBuilder perl-ExtUtils-MakeMaker
wget https://github.com/yoshinorim/mha4mysql-manager/releases/download/v0.58/mha4mysql-manager-0.58.tar.gz
tar -xzf mha4mysql-manager-0.58.tar.gz
cd mha4mysql-manager-0.58
perl Makefile.PL
make && make install
# Sample /etc/mha/app1.cnf (trimmed for brevity)
[server1]
hostname=192.168.1.100
port=3306
candidate_master=1
[server2]
hostname=192.168.1.101
port=3306
candidate_master=1
[server3]
hostname=192.168.1.102
port=3306
candidate_master=1 # Start MHA manager
nohup masterha_manager --conf=/etc/mha/app1.cnf --remove_dead_master_conf \
--ignore_last_failover &> /var/log/mha/app1/manager.log 2>&1 &
# Simulate master failure
systemctl stop mysqld # on original master
# Observe failover
tail -f /var/log/mha/app1/manager.logRead‑Write Splitting with ProxySQL
# Install ProxySQL
cat <<EOF | tee /etc/yum.repos.d/proxysql.repo
[proxysql_repo]
name=ProxySQL YUM repository
baseurl=https://repo.proxysql.com/ProxySQL/proxysql-2.5.x/centos/$releasever
gpgcheck=1
gpgkey=https://repo.proxysql.com/ProxySQL/proxysql-2.5.x/repo_pub_key
EOF
yum install -y proxysql
systemctl start proxysql
systemctl enable proxysql
# Configure servers and query rules
mysql -uadmin -padmin -h127.0.0.1 -P6032 <<SQL
INSERT INTO mysql_servers(hostgroup_id,hostname,port,weight) VALUES
(10,'192.168.1.100',3306,1000),
(20,'192.168.1.101',3306,900),
(20,'192.168.1.102',3306,900);
UPDATE global_variables SET variable_value='monitor' WHERE variable_name='mysql-monitor_username';
UPDATE global_variables SET variable_value='Monitor@2024' WHERE variable_name='mysql-monitor_password';
INSERT INTO mysql_query_rules(rule_id,match_pattern,destination_hostgroup,apply) VALUES
(1,'^SELECT.*FOR UPDATE$',10,1),
(2,'^SELECT',20,1),
(3,'^SHOW',20,1),
(4,'.*',10,1);
INSERT INTO mysql_users(username,password,default_hostgroup) VALUES('app_user','App@2024Pass',10);
LOAD MYSQL SERVERS TO RUNTIME;
LOAD MYSQL QUERY RULES TO RUNTIME;
LOAD MYSQL USERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;
SAVE MYSQL QUERY RULES TO DISK;
SAVE MYSQL USERS TO DISK;
SQLPerformance Tuning and Monitoring
Enable parallel replication, tune InnoDB buffers, and monitor key metrics with Prometheus.
# Enable parallel replication (MySQL 5.7+)
SET GLOBAL slave_parallel_type='LOGICAL_CLOCK';
SET GLOBAL slave_parallel_workers=16;
SET GLOBAL slave_pending_jobs_size_max=536870912;
SET GLOBAL slave_preserve_commit_order=ON; # Example Prometheus scrape config (trimmed)
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'mysql'
static_configs:
- targets: ['192.168.1.100:9104','192.168.1.101:9104','192.168.1.102:9104']
- job_name: 'proxysql'
static_configs:
- targets: ['192.168.1.200:42004']Common Issues and Troubleshooting
Replication Stopped – Duplicate Key
STOP SLAVE;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1;
START SLAVE;Root cause analysis: compare row counts, identify conflicting rows, and delete or reconcile them.
GTID Errors
STOP SLAVE;
SET GTID_NEXT='3E11FA47-71CA-11E1-9E33-C80AA9429562:5';
BEGIN; COMMIT;
SET GTID_NEXT='AUTOMATIC';
START SLAVE;Data Inconsistency Check
CREATE PROCEDURE check_data_consistency(IN p_table_name VARCHAR(64), IN p_database VARCHAR(64))
BEGIN
DECLARE v_master_count BIGINT;
DECLARE v_slave_count BIGINT;
DECLARE v_master_checksum VARCHAR(40);
DECLARE v_slave_checksum VARCHAR(40);
SET @sql = CONCAT('SELECT COUNT(*) INTO @master_count FROM ',p_database,'.',p_table_name);
PREPARE stmt FROM @sql; EXECUTE stmt; DEALLOCATE PREPARE stmt;
SET @sql = CONCAT('SELECT MD5(GROUP_CONCAT(MD5(CONCAT_WS("-",*))) INTO @master_checksum FROM ',p_database,'.',p_table_name);
PREPARE stmt FROM @sql; EXECUTE stmt; DEALLOCATE PREPARE stmt;
-- Slave side should be executed on the replica or via federated table
SELECT p_table_name AS table_name,
@master_count AS master_rows,
@master_checksum AS master_checksum,
CASE WHEN @master_count!=@slave_count THEN 'Row count mismatch'
WHEN @master_checksum!=@slave_checksum THEN 'Data mismatch'
ELSE 'Consistent' END AS status;
END;Enterprise Best‑Practice Summary
Never rely on a single point; always have standby nodes and automated failover.
Comprehensive monitoring with alerts at multiple severity levels.
Data safety: enable sync_binlog=1, use GTID, and perform regular backup verification.
Capacity planning: model QPS growth, required slaves, and storage needs.
Security hardening: principle of least privilege, audit logging, SSL encryption.
Future Directions
MySQL is moving toward native multi‑master solutions such as Group Replication and InnoDB Cluster, as well as cloud‑native deployments on Kubernetes. Automation and AI‑driven operations will become the new standard.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
