Databases 9 min read

MySQL High‑Availability Incident Review and Recovery Steps

The article recounts a production‑like MySQL dual‑master HA setup using Keepalived, describes how a missing binary‑log index caused replication failure, and details step‑by‑step troubleshooting commands, configuration fixes, and preventive measures to restore reliable database synchronization.

Wukong Talks Architecture

Jun 14, 2022

MySQL High‑Availability Incident Review and Recovery Steps

I'm Wukong. Previously we deployed a MySQL high‑availability architecture with dual‑master mode and Keepalived to ensure automatic failover. This post revisits a recent incident in our test environment where the database became unavailable and walks through the entire troubleshooting process.

Incident details : At 10:30 am the testing team reported a severe issue, suspecting a database problem. Since I had set up the cluster, I began investigating.

System deployment : Two MySQL instances run on nodes node55 and node56 in a dual‑master configuration. Each node also runs a Keepalived instance that monitors the MySQL container.

Initial diagnosis : I first expected Keepalived to restart MySQL automatically. Running docker ps showed that neither MySQL container was listed, meaning they were not running. Checking Keepalived with systemctl status keepalived confirmed the service was healthy. A subsequent docker ps -a revealed that the MySQL containers kept restarting and then exiting.

Root cause discovery : Container logs indicated that the file mysql-bin.index was missing. This file is required for binary‑log replication. To recreate it, I created the missing log directory and gave it full permissions:

mkdir log
chmod 777 log -R

After the directory existed, Keepalived restarted the MySQL containers, but the slave still reported:

Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find first log file name in binary log index file'

The replication status showed Slave_IO_Running: NO and Master_Log_File: mysql-bin.000014, which did not match the index file (which still pointed to mysql-bin.000001). This mismatch caused the error.

Replication mechanism recap : The master generates a dump thread that streams binlog files to the slave. The slave’s I/O thread writes these to a relay‑log, and the SQL thread executes the statements. Because the index was wrong, the I/O thread could not start.

Solution : On the master ( node55) I captured the current binlog file and position:

FLUSH TABLES WITH READ LOCK;
SHOW MASTER STATUS;
UNLOCK TABLES;

Assuming the output was File=mysql-bin.000001, Position=117748, I re‑configured the slave ( node56) as follows:

# Stop replication
STOP SLAVE;

# Set correct master log file and position
CHANGE MASTER TO MASTER_HOST='10.2.1.55',
MASTER_PORT=3306,
MASTER_USER='vagrant',
MASTER_PASSWORD='vagrant',
MASTER_LOG_FILE='mysql-bin.000001',
MASTER_LOG_POS=117748;

# Restart replication
START SLAVE;

After applying these commands the I/O thread resumed, replication became healthy, and both nodes reported normal status.

Additional findings : The /var/lib/mysql/log directory had been removed, likely by a colleague who cleaned up old databases via Navicat. To avoid this, I recommend separating the binary‑log directory from the data directory, e.g.,

datadir = /var/lib/mysql/data
log_bin = /var/lib/mysql/log

Lessons learned : The HA setup did not provide timely alerts; the failure was only noticed after the testing team reported it. Future improvements include configuring Keepalived to send email notifications or integrating log‑based monitoring to detect MySQL anomalies automatically.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

mysql Databases high-availability Keepalived

Written by

Wukong Talks Architecture

Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.