Databases 27 min read

How to Build a High‑Availability MySQL Master‑Slave Cluster and Automate Failover

This guide walks through the reasons for MySQL master‑slave replication, explains its core mechanisms, details step‑by‑step environment planning, configuration, data initialization, replication setup, monitoring, failover with MHA, read‑write splitting using ProxySQL, performance tuning, troubleshooting, and best‑practice recommendations for enterprise‑grade high availability.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How to Build a High‑Availability MySQL Master‑Slave Cluster and Automate Failover

MySQL Master‑Slave Replication Architecture Setup and Failover

Introduction : A production outage at 3 am highlighted the lack of a master‑slave replication architecture, turning the database into a single point of failure. This article shares lessons learned, common pitfalls, and a complete, production‑validated solution.

Why MySQL Needs Master‑Slave Replication

Replication eliminates single‑point failures, provides high‑availability, enables read‑write separation, offers live backups, and allows seamless scaling.

High availability: a slave can take over instantly when the master fails.

Read‑write separation: query load is distributed across slaves, reducing response time by up to 60%.

Data safety: real‑time synchronized slaves act as live backups.

Scalability: additional slaves can be added without downtime.

Replication Principles

Three threads are involved:

Binlog Dump thread on the master reads the binary log and streams it.

I/O thread on the slave receives the binlog and writes it to a relay log.

SQL thread on the slave reads the relay log and executes the statements.

Binlog Format Choices

Statement : small log size, but functions like NOW() can cause inconsistencies.

Row : best consistency, larger log size.

Mixed : MySQL decides automatically; recommended for most cases.

GTID Overview

GTID provides a globally unique transaction ID, simplifying failover positioning and ensuring data consistency across complex topologies.

Production‑Ready Architecture Build

Environment Planning

CPU: master ≥ 8 cores, slaves ≥ 70 % of master.

Memory: master 32 GB+, slaves ≥ 16 GB.

Disk: SSD with ≥ 10 000 IOPS.

Network: 1 Gbps, latency < 1 ms between nodes.

OS: CentOS 7.9 or Ubuntu 20.04 LTS.

MySQL 8.0.30+ and monitoring tools (Prometheus + Grafana).

Master Configuration

[mysqld]
server-id = 1
port = 3306
basedir = /usr/local/mysql
datadir = /data/mysql/data
socket = /tmp/mysql.sock
pid-file = /data/mysql/mysql.pid
character-set-server = utf8mb4
collation-server = utf8mb4_general_ci
log-bin = /data/mysql/binlog/mysql-bin
binlog_format = mixed
binlog_row_image = minimal
max_binlog_size = 1G
binlog_expire_logs_seconds = 604800
sync_binlog = 1

gtid_mode = ON
enforce_gtid_consistency = ON
binlog_gtid_simple_recovery = 1

binlog_cache_size = 4M
max_binlog_cache_size = 2G
binlog_stmt_cache_size = 4M
binlog_transaction_dependency_tracking = WRITESET
transaction_write_set_extraction = XXHASH64

innodb_buffer_pool_size = 20G
innodb_log_file_size = 2G
innodb_log_buffer_size = 64M
innodb_flush_log_at_trx_commit = 1
innodb_flush_method = O_DIRECT
innodb_io_capacity = 10000
innodb_io_capacity_max = 20000
innodb_buffer_pool_instances = 8

max_connections = 3000
max_connect_errors = 1000000
thread_cache_size = 128
thread_stack = 256K

slow_query_log = 1
slow_query_log_file = /data/mysql/logs/slow.log
long_query_time = 1
log_queries_not_using_indexes = 1

log_error = /data/mysql/logs/error.log
log_error_verbosity = 2

performance_schema = ON
performance_schema_instrument = '%=ON'

Slave Configuration

[mysqld]
server-id = 2
port = 3306
basedir = /usr/local/mysql
datadir = /data/mysql/data
socket = /tmp/mysql.sock

relay_log = /data/mysql/relaylog/relay-bin
relay_log_index = /data/mysql/relaylog/relay-bin.index
relay_log_info_repository = TABLE
master_info_repository = TABLE
relay_log_recovery = ON
relay_log_purge = ON
read_only = ON
super_read_only = ON

gtid_mode = ON
enforce_gtid_consistency = ON

slave_parallel_type = LOGICAL_CLOCK
slave_parallel_workers = 8
slave_preserve_commit_order = ON
slave_pending_jobs_size_max = 134217728

innodb_flush_log_at_trx_commit = 2
innodb_buffer_pool_size = 16G

Step‑by‑Step Setup

Step 1: Create Replication User

CREATE USER 'replicator'@'%' IDENTIFIED WITH mysql_native_password BY 'Repl@2024Strong';
GRANT REPLICATION SLAVE ON *.* TO 'replicator'@'%';
FLUSH PRIVILEGES;
SHOW GRANTS FOR 'replicator'@'%';

Step 2: Backup Master Data (Percona XtraBackup)

# Install XtraBackup
yum install -y https://repo.percona.com/yum/percona-release-latest.noarch.rpm
percona-release setup ps80
yum install -y percona-xtrabackup-80

# Perform backup
xtrabackup --defaults-file=/etc/my.cnf \
  --user=root --password='YourRootPassword' \
  --backup --target-dir=/backup/full \
  --parallel=4 --compress --compress-threads=4

# Prepare backup
xtrabackup --prepare --target-dir=/backup/full

# Get binlog position
cat /backup/full/xtrabackup_binlog_info

Step 3: Restore to Slave

# Stop MySQL on slave
systemctl stop mysqld

# Clean data directory
rm -rf /data/mysql/data/*

# Copy backup data
xtrabackup --copy-back --target-dir=/backup/full --datadir=/data/mysql/data
chown -R mysql:mysql /data/mysql/data

# Start MySQL
systemctl start mysqld

Step 4: Configure Replication

CHANGE MASTER TO MASTER_HOST='192.168.1.100',
  MASTER_USER='replicator',
  MASTER_PASSWORD='Repl@2024Strong',
  MASTER_PORT=3306,
  MASTER_AUTO_POSITION=1;
START SLAVE;
SHOW SLAVE STATUS\G

High‑Availability and Failover with MHA

MHA can promote a slave within 30 seconds with zero data loss.

# Install MHA Manager
yum install -y perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch \
  perl-Parallel-ForkManager perl-ExtUtils-CBuilder perl-ExtUtils-MakeMaker
wget https://github.com/yoshinorim/mha4mysql-manager/releases/download/v0.58/mha4mysql-manager-0.58.tar.gz
tar -xzf mha4mysql-manager-0.58.tar.gz
cd mha4mysql-manager-0.58
perl Makefile.PL
make && make install

# Sample /etc/mha/app1.cnf (trimmed for brevity)
[server1]
hostname=192.168.1.100
port=3306
candidate_master=1

[server2]
hostname=192.168.1.101
port=3306
candidate_master=1

[server3]
hostname=192.168.1.102
port=3306
candidate_master=1
# Start MHA manager
nohup masterha_manager --conf=/etc/mha/app1.cnf --remove_dead_master_conf \
  --ignore_last_failover &> /var/log/mha/app1/manager.log 2>&1 &

# Simulate master failure
systemctl stop mysqld   # on original master

# Observe failover
tail -f /var/log/mha/app1/manager.log

Read‑Write Splitting with ProxySQL

# Install ProxySQL
cat <<EOF | tee /etc/yum.repos.d/proxysql.repo
[proxysql_repo]
name=ProxySQL YUM repository
baseurl=https://repo.proxysql.com/ProxySQL/proxysql-2.5.x/centos/$releasever
gpgcheck=1
gpgkey=https://repo.proxysql.com/ProxySQL/proxysql-2.5.x/repo_pub_key
EOF

yum install -y proxysql
systemctl start proxysql
systemctl enable proxysql

# Configure servers and query rules
mysql -uadmin -padmin -h127.0.0.1 -P6032 <<SQL
INSERT INTO mysql_servers(hostgroup_id,hostname,port,weight) VALUES
(10,'192.168.1.100',3306,1000),
(20,'192.168.1.101',3306,900),
(20,'192.168.1.102',3306,900);

UPDATE global_variables SET variable_value='monitor' WHERE variable_name='mysql-monitor_username';
UPDATE global_variables SET variable_value='Monitor@2024' WHERE variable_name='mysql-monitor_password';

INSERT INTO mysql_query_rules(rule_id,match_pattern,destination_hostgroup,apply) VALUES
(1,'^SELECT.*FOR UPDATE$',10,1),
(2,'^SELECT',20,1),
(3,'^SHOW',20,1),
(4,'.*',10,1);

INSERT INTO mysql_users(username,password,default_hostgroup) VALUES('app_user','App@2024Pass',10);

LOAD MYSQL SERVERS TO RUNTIME;
LOAD MYSQL QUERY RULES TO RUNTIME;
LOAD MYSQL USERS TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;
SAVE MYSQL QUERY RULES TO DISK;
SAVE MYSQL USERS TO DISK;
SQL

Performance Tuning and Monitoring

Enable parallel replication, tune InnoDB buffers, and monitor key metrics with Prometheus.

# Enable parallel replication (MySQL 5.7+)
SET GLOBAL slave_parallel_type='LOGICAL_CLOCK';
SET GLOBAL slave_parallel_workers=16;
SET GLOBAL slave_pending_jobs_size_max=536870912;
SET GLOBAL slave_preserve_commit_order=ON;
# Example Prometheus scrape config (trimmed)
global:
  scrape_interval: 15s
scrape_configs:
- job_name: 'mysql'
  static_configs:
  - targets: ['192.168.1.100:9104','192.168.1.101:9104','192.168.1.102:9104']
- job_name: 'proxysql'
  static_configs:
  - targets: ['192.168.1.200:42004']

Common Issues and Troubleshooting

Replication Stopped – Duplicate Key

STOP SLAVE;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1;
START SLAVE;

Root cause analysis: compare row counts, identify conflicting rows, and delete or reconcile them.

GTID Errors

STOP SLAVE;
SET GTID_NEXT='3E11FA47-71CA-11E1-9E33-C80AA9429562:5';
BEGIN; COMMIT;
SET GTID_NEXT='AUTOMATIC';
START SLAVE;

Data Inconsistency Check

CREATE PROCEDURE check_data_consistency(IN p_table_name VARCHAR(64), IN p_database VARCHAR(64))
BEGIN
  DECLARE v_master_count BIGINT;
  DECLARE v_slave_count BIGINT;
  DECLARE v_master_checksum VARCHAR(40);
  DECLARE v_slave_checksum VARCHAR(40);
  SET @sql = CONCAT('SELECT COUNT(*) INTO @master_count FROM ',p_database,'.',p_table_name);
  PREPARE stmt FROM @sql; EXECUTE stmt; DEALLOCATE PREPARE stmt;
  SET @sql = CONCAT('SELECT MD5(GROUP_CONCAT(MD5(CONCAT_WS("-",*))) INTO @master_checksum FROM ',p_database,'.',p_table_name);
  PREPARE stmt FROM @sql; EXECUTE stmt; DEALLOCATE PREPARE stmt;
  -- Slave side should be executed on the replica or via federated table
  SELECT p_table_name AS table_name,
         @master_count AS master_rows,
         @master_checksum AS master_checksum,
         CASE WHEN @master_count!=@slave_count THEN 'Row count mismatch'
              WHEN @master_checksum!=@slave_checksum THEN 'Data mismatch'
              ELSE 'Consistent' END AS status;
END;

Enterprise Best‑Practice Summary

Never rely on a single point; always have standby nodes and automated failover.

Comprehensive monitoring with alerts at multiple severity levels.

Data safety: enable sync_binlog=1, use GTID, and perform regular backup verification.

Capacity planning: model QPS growth, required slaves, and storage needs.

Security hardening: principle of least privilege, audit logging, SSL encryption.

Future Directions

MySQL is moving toward native multi‑master solutions such as Group Replication and InnoDB Cluster, as well as cloud‑native deployments on Kubernetes. Automation and AI‑driven operations will become the new standard.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

high availabilityperformance tuningmysqlReplicationfailover
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.