How to Detect and Recover from RabbitMQ Network Partitions
This article explains why RabbitMQ clusters struggle with network partitions, how to detect partition events via logs and rabbitmqctl, the impact on queues and bindings, and step‑by‑step methods—including manual recovery commands and automatic handling modes—to restore a healthy cluster.
1. Cluster and Network Partition
RabbitMQ clusters cannot tolerate network partitions well. When deploying across a WAN, consider using federation or shovel . This article describes how to detect partitions, the adverse effects they cause, and how to recover.
2. Detecting Network Partitions
If a node cannot contact another node for about a minute (see net_ticktime), Mnesia marks the node as down. If both nodes think the other is down, a partition is detected and logged, e.g.:
=ERROR REPORT==== X15-Oct-2012::18:02:30 ===
Mnesia(rabbit@smacmullen): ** ERROR ** mnesia_event got
{inconsistent_database, running_partitioned_network, hare@smacmullen}After a node restarts, RabbitMQ records the event and displays it via rabbitmqctl cluster_status or the management plugin. A normal status shows an empty partition list:
# rabbitmqctl cluster_status
Cluster status node rabbit@smacmullen ...
[{nodes, [{disc, [hare@smacmullen, rabbit@smacmullen]}]},
{running_nodes, [rabbit@smacmullen, hare@smacmullen]},
{partitions, []}]
... done.If a partition exists, the partition list is populated, and the management UI shows a large red warning (illustrated below).
3. Effects of a Network Partition
During a partition, two or more nodes consider the others crashed. Queues, bindings, and exchanges may be created or deleted independently in each partition. Mirror queues split, each side electing its own master. Other undefined or odd behaviors can also appear.
When the network heals, the split state persists until you take corrective action.
4. Recovering from a Partition
Choose the partition you trust most; Mnesia will treat it as the reliable source, discarding changes from other partitions. Stop all nodes in the other partitions, then restart them so they re‑join from the trusted partition. Finally, restart the trusted nodes to clear warnings.
One simple approach is to stop the entire cluster and restart it, ensuring the first node started belongs to the trusted partition.
Typical recovery commands:
# kill -p(rabbitmq pid)
# rm /var/lib/rabbitmq/mnesia
# ./rabbitmq-server &
# ./rabbitmqctl stop_app
# ./rabbitmqctl join_cluster rabbit@<em>URL</em>
# ./rabbitmqctl start_app5. Automatic Partition Handling
RabbitMQ offers two automatic handling modes: pause‑minority and autoheal (default is ignore ).
In pause‑minority mode, the cluster pauses the minority side, favoring partition tolerance over availability. Nodes continue running the Erlang VM but stop listening on ports.
In autoheal mode, RabbitMQ automatically selects a winning partition (the one with the most client connections) and restarts nodes not in that partition.
Configure the mode via the cluster_partition_handling parameter in the RabbitMQ configuration file.
Guidelines for choosing a mode:
ignore : reliable network, all nodes on the same rack/switch.
pause_minority : very unreliable network.
autoheal : network may be unreliable.
Note that in pause‑minority mode, nodes will not re‑enter pause after a restart, and in a two‑node cluster the mode can cause both nodes to pause, making it unsuitable for small clusters.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
