Databases 26 min read

How to Diagnose and Recover MySQL InnoDB Cluster Failures: Real‑World Scenarios

This article walks MySQL DBAs through common MySQL InnoDB Cluster fault scenarios—node restarts, crashes, network partitions, and full‑cluster reboots—providing step‑by‑step commands, status outputs, recovery actions, and impact analysis to ensure high availability and data safety.

Aikesheng Open Source Community
Aikesheng Open Source Community
Aikesheng Open Source Community
How to Diagnose and Recover MySQL InnoDB Cluster Failures: Real‑World Scenarios

Background

MySQL InnoDB Cluster (MIC) is the official high‑availability solution built on MySQL Group Replication and MySQL Shell. It provides automatic fault detection, failover, and data‑consistency guarantees.

Test Environment

OS: Red Hat Enterprise Linux 8.10 (Ootpa)

MySQL Server: mysql-community-server-8.4.5-1.el8.x86_64

MySQL Shell: mysql-shell-8.4.5-1.el8.x86_64

MySQL Router: mysql-router-community-8.4.5-1.el8.x86_64

Scenario 1 – Restart a Non‑Primary Instance

Fault simulation

# On node2
systemctl restart mysql

# On node1
mysqlsh --uri [email protected]:3306
\js
var cluster = dba.getCluster();
cluster.status();

Recovery process

# Check initial status
mysqlsh --uri [email protected]:3306
\js
var cluster = dba.getCluster();
cluster.status();

# While node2 restarts, status shows:
{..."status":"OK_NO_TOLERANCE_PARTIAL","statusText":"Cluster is NOT tolerant to any failures. 1 member is not active.",...}

# After node2 is up again:
{..."status":"OK","statusText":"Cluster is ONLINE and can tolerate up to ONE failure.",...}

Summary

Fault symptoms

node2 appears as (MISSING) Cluster status changes to OK_NO_TOLERANCE_PARTIAL Clients see “Lost connection to MySQL server” errors

Recovery key points

The restarted replica is automatically detected and rejoins the cluster

No manual steps are required; the node returns to ONLINE Read traffic continues while the replica is down

Impact scope

Fault‑tolerance drops to zero during the restart

Write operations are unaffected

Read load balancing may be briefly impacted

Scenario 2 – Crash a Non‑Primary Instance

Fault simulation

# On node3
systemctl stop mysql

# On node1
mysqlsh --uri [email protected]:3306
\js
var cluster = dba.getCluster();
cluster.status();

# On node3
systemctl start mysql

Recovery process

# After stopping node3, status shows:
{..."status":"OK_NO_TOLERANCE_PARTIAL",...}

# After node3 restarts, status returns to:
{..."status":"OK","statusText":"Cluster is ONLINE and can tolerate up to ONE failure.",...}

Summary

Fault symptoms

node3 shows (MISSING) Cluster status becomes OK_NO_TOLERANCE_PARTIAL Connection errors appear

Recovery key points

The stopped replica automatically rejoins once it starts

Data is synchronized without manual intervention

No human action is required

Impact scope

Fault‑tolerance is lost while the node is down

Read load may drop slightly

Primary writes remain unaffected

Scenario 3 – Network Partition Between Primary and Replicas

Fault simulation

# On node1 (simulate partition)
./stop_net.sh   # iptables rules drop traffic to node2 and node3

# Check status from node1
mysqlsh --uri [email protected]:3306
\js
var cluster = dba.getCluster();
cluster.status();

Recovery process

# While network is cut, node1 sees:
{..."status":"NO_QUORUM","statusText":"Cluster has no quorum as visible from 'node1:3306' and cannot process write transactions.",...}

# Nodes 2 and 3 remain ONLINE from their own view

# After restoring the network, node1 status returns to OK:
{..."status":"OK","statusText":"Cluster is ONLINE and can tolerate up to ONE failure.",...}

Summary

Fault symptoms

Primary reports NO_QUORUM Replicas are reported as UNREACHABLE Potential split‑brain warnings appear

Recovery key points

When the network is restored, the primary automatically reconnects

Data‑consistency checks run automatically

Primary/replica roles are renegotiated without manual steps

Impact scope

Writes are blocked while the primary lacks quorum

A majority of nodes must be reachable to resume writes

Post‑reconnection data conflicts may need resolution

Scenario 4 – Restart the Primary Instance

Fault simulation

# On node1 (primary)
systemctl restart mysql

# After restart, check status
mysqlsh --uri [email protected]:3306
\js
var cluster = dba.getCluster();
cluster.status();

Recovery process

# While primary restarts, node2 is promoted to primary and status becomes:
{..."primary":"node2:3306","status":"OK_NO_TOLERANCE_PARTIAL","node1:3306":{"status":"(MISSING)"}...}

# After node1 comes back, it may remain <code>(MISSING)</code>. If it does not auto‑join, run:
cluster.rejoinInstance("clusteruser@node1:3306");

Summary

Fault symptoms

Primary fails over to node2

node1 appears as (MISSING) and is demoted to SECONDARY

Recovery key points

Cluster automatically promotes a new primary

The original primary must be rejoined manually (or automatically) after it is back online

Applications need to handle the brief primary change

Impact scope

Writes are briefly interrupted during failover

Clients may need to reconnect to the new primary

Fault‑tolerance is reduced until the original primary rejoins

Scenario 5 – Simultaneous Restart of All Nodes (Quorum Loss)

Fault simulation

# On all nodes
./restart_node.sh   # stop mysql, sleep 30, start mysql

# After restart, any status query fails with:
Cluster.status: The cluster object is disconnected. Please use dba.getCluster() to obtain a fresh cluster handle.

Recovery process

# Choose the node with the most complete GTID set (e.g., node1) and bootstrap Group Replication
SET GLOBAL group_replication_bootstrap_group = ON;
START GROUP_REPLICATION;
SET GLOBAL group_replication_bootstrap_group = OFF;

# Rejoin the other nodes
cluster.rejoinInstance("clusteruser@node2:3306");
cluster.rejoinInstance("clusteruser@node3:3306");

# Verify final status
{..."status":"OK","statusText":"Cluster is ONLINE and can tolerate up to ONE failure.",...}

Summary

Fault symptoms

All nodes restart and the cluster becomes disconnected

Group Replication does not start automatically

Manual bootstrap and rejoin steps are required

Recovery key points

Select the node with the most complete GTID as the bootstrap node

Enable group_replication_bootstrap_group to start the group

Use rejoinInstance to add the remaining nodes back into the cluster

Impact scope

The entire cluster service is interrupted

Manual intervention is needed to restore availability

Recovery time is longer than for single‑node failures

Conclusion

The tests demonstrate that MySQL InnoDB Cluster automatically recovers from single‑node restarts or crashes, and from network partitions once connectivity is restored. Primary failures trigger automatic failover. However, simultaneous restarts that cause quorum loss require manual bootstrap and re‑joining, highlighting scenarios where human intervention is essential to maintain high availability and data integrity.

References

MySQL InnoDB Cluster documentation: https://dev.mysql.com/doc/mysql-shell/8.4/en/monitoring-innodb-cluster.html

MySQL Group Replication documentation: https://dev.mysql.com/doc/en/group-replication.html

MySQL Shell rejoin‑cluster documentation: https://dev.mysql.com/doc/mysql-shell/8.4/en/rejoin-cluster.html

high availabilityMySQLDatabase OperationsInnoDB Clusterfault-recovery
Aikesheng Open Source Community
Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.