How to Diagnose and Recover MySQL InnoDB Cluster Failures: Real‑World Scenarios
This article walks MySQL DBAs through common MySQL InnoDB Cluster fault scenarios—node restarts, crashes, network partitions, and full‑cluster reboots—providing step‑by‑step commands, status outputs, recovery actions, and impact analysis to ensure high availability and data safety.
Background
MySQL InnoDB Cluster (MIC) is the official high‑availability solution built on MySQL Group Replication and MySQL Shell. It provides automatic fault detection, failover, and data‑consistency guarantees.
Test Environment
OS: Red Hat Enterprise Linux 8.10 (Ootpa)
MySQL Server: mysql-community-server-8.4.5-1.el8.x86_64
MySQL Shell: mysql-shell-8.4.5-1.el8.x86_64
MySQL Router: mysql-router-community-8.4.5-1.el8.x86_64
Scenario 1 – Restart a Non‑Primary Instance
Fault simulation
# On node2
systemctl restart mysql
# On node1
mysqlsh --uri [email protected]:3306
\js
var cluster = dba.getCluster();
cluster.status();Recovery process
# Check initial status
mysqlsh --uri [email protected]:3306
\js
var cluster = dba.getCluster();
cluster.status();
# While node2 restarts, status shows:
{..."status":"OK_NO_TOLERANCE_PARTIAL","statusText":"Cluster is NOT tolerant to any failures. 1 member is not active.",...}
# After node2 is up again:
{..."status":"OK","statusText":"Cluster is ONLINE and can tolerate up to ONE failure.",...}Summary
Fault symptoms
node2 appears as (MISSING) Cluster status changes to OK_NO_TOLERANCE_PARTIAL Clients see “Lost connection to MySQL server” errors
Recovery key points
The restarted replica is automatically detected and rejoins the cluster
No manual steps are required; the node returns to ONLINE Read traffic continues while the replica is down
Impact scope
Fault‑tolerance drops to zero during the restart
Write operations are unaffected
Read load balancing may be briefly impacted
Scenario 2 – Crash a Non‑Primary Instance
Fault simulation
# On node3
systemctl stop mysql
# On node1
mysqlsh --uri [email protected]:3306
\js
var cluster = dba.getCluster();
cluster.status();
# On node3
systemctl start mysqlRecovery process
# After stopping node3, status shows:
{..."status":"OK_NO_TOLERANCE_PARTIAL",...}
# After node3 restarts, status returns to:
{..."status":"OK","statusText":"Cluster is ONLINE and can tolerate up to ONE failure.",...}Summary
Fault symptoms
node3 shows (MISSING) Cluster status becomes OK_NO_TOLERANCE_PARTIAL Connection errors appear
Recovery key points
The stopped replica automatically rejoins once it starts
Data is synchronized without manual intervention
No human action is required
Impact scope
Fault‑tolerance is lost while the node is down
Read load may drop slightly
Primary writes remain unaffected
Scenario 3 – Network Partition Between Primary and Replicas
Fault simulation
# On node1 (simulate partition)
./stop_net.sh # iptables rules drop traffic to node2 and node3
# Check status from node1
mysqlsh --uri [email protected]:3306
\js
var cluster = dba.getCluster();
cluster.status();Recovery process
# While network is cut, node1 sees:
{..."status":"NO_QUORUM","statusText":"Cluster has no quorum as visible from 'node1:3306' and cannot process write transactions.",...}
# Nodes 2 and 3 remain ONLINE from their own view
# After restoring the network, node1 status returns to OK:
{..."status":"OK","statusText":"Cluster is ONLINE and can tolerate up to ONE failure.",...}Summary
Fault symptoms
Primary reports NO_QUORUM Replicas are reported as UNREACHABLE Potential split‑brain warnings appear
Recovery key points
When the network is restored, the primary automatically reconnects
Data‑consistency checks run automatically
Primary/replica roles are renegotiated without manual steps
Impact scope
Writes are blocked while the primary lacks quorum
A majority of nodes must be reachable to resume writes
Post‑reconnection data conflicts may need resolution
Scenario 4 – Restart the Primary Instance
Fault simulation
# On node1 (primary)
systemctl restart mysql
# After restart, check status
mysqlsh --uri [email protected]:3306
\js
var cluster = dba.getCluster();
cluster.status();Recovery process
# While primary restarts, node2 is promoted to primary and status becomes:
{..."primary":"node2:3306","status":"OK_NO_TOLERANCE_PARTIAL","node1:3306":{"status":"(MISSING)"}...}
# After node1 comes back, it may remain <code>(MISSING)</code>. If it does not auto‑join, run:
cluster.rejoinInstance("clusteruser@node1:3306");Summary
Fault symptoms
Primary fails over to node2
node1 appears as (MISSING) and is demoted to SECONDARY
Recovery key points
Cluster automatically promotes a new primary
The original primary must be rejoined manually (or automatically) after it is back online
Applications need to handle the brief primary change
Impact scope
Writes are briefly interrupted during failover
Clients may need to reconnect to the new primary
Fault‑tolerance is reduced until the original primary rejoins
Scenario 5 – Simultaneous Restart of All Nodes (Quorum Loss)
Fault simulation
# On all nodes
./restart_node.sh # stop mysql, sleep 30, start mysql
# After restart, any status query fails with:
Cluster.status: The cluster object is disconnected. Please use dba.getCluster() to obtain a fresh cluster handle.Recovery process
# Choose the node with the most complete GTID set (e.g., node1) and bootstrap Group Replication
SET GLOBAL group_replication_bootstrap_group = ON;
START GROUP_REPLICATION;
SET GLOBAL group_replication_bootstrap_group = OFF;
# Rejoin the other nodes
cluster.rejoinInstance("clusteruser@node2:3306");
cluster.rejoinInstance("clusteruser@node3:3306");
# Verify final status
{..."status":"OK","statusText":"Cluster is ONLINE and can tolerate up to ONE failure.",...}Summary
Fault symptoms
All nodes restart and the cluster becomes disconnected
Group Replication does not start automatically
Manual bootstrap and rejoin steps are required
Recovery key points
Select the node with the most complete GTID as the bootstrap node
Enable group_replication_bootstrap_group to start the group
Use rejoinInstance to add the remaining nodes back into the cluster
Impact scope
The entire cluster service is interrupted
Manual intervention is needed to restore availability
Recovery time is longer than for single‑node failures
Conclusion
The tests demonstrate that MySQL InnoDB Cluster automatically recovers from single‑node restarts or crashes, and from network partitions once connectivity is restored. Primary failures trigger automatic failover. However, simultaneous restarts that cause quorum loss require manual bootstrap and re‑joining, highlighting scenarios where human intervention is essential to maintain high availability and data integrity.
References
MySQL InnoDB Cluster documentation: https://dev.mysql.com/doc/mysql-shell/8.4/en/monitoring-innodb-cluster.html
MySQL Group Replication documentation: https://dev.mysql.com/doc/en/group-replication.html
MySQL Shell rejoin‑cluster documentation: https://dev.mysql.com/doc/mysql-shell/8.4/en/rejoin-cluster.html
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
