Backend Development 25 min read

How to Build a Million‑Message‑Per‑Second RabbitMQ Cluster: Lessons from Google

This article explains how to design and scale a RabbitMQ cluster capable of handling millions of messages per second, covering core concepts, Google’s large‑scale test setup, sharding and federation plugins, mirror queues, reliability mechanisms, and practical tips for high‑availability and performance optimization.

ITFLY8 Architecture Home

Apr 26, 2020

How to Build a Million‑Message‑Per‑Second RabbitMQ Cluster: Lessons from Google

Background

Using RabbitMQ cluster horizontal scaling to balance traffic pressure, the message cluster can achieve million‑level per‑second service capacity; Google performed similar experiments, and we summarize our practice experience and pitfalls.

RabbitMQ Overview

RabbitMQ is a message middleware based on the AMQP protocol, commonly used to store and forward messages in distributed systems. It is easy to use, scalable, and highly available, originally popular in financial systems. The main purpose of a message broker is to decouple service components, allowing producers to send messages without needing to know consumers.

Key components:

Message: produced by a producer, routed by an exchange to a queue, then consumed.

Queue: stores messages until a consumer retrieves them.

Binding: maps a queue to an exchange according to routing rules.

Exchange: routes messages based on routing keys; types include topic, direct, fanout.

Broker: the server process that implements AMQP routing.

Virtual host: isolates a set of exchanges, queues, and bindings.

Connection: TCP connection between client and broker.

Channel: lightweight multiplexed stream over a connection.

Cluster Modes

RabbitMQ brokers can be grouped into a cluster of Erlang nodes. Two modes exist: the default mode where a queue’s messages reside on a single node (metadata is replicated across nodes), and the mirrored‑queue (HA) mode where queues are replicated across multiple nodes. The default mode can cause bottlenecks and message loss if the node holding the queue fails; mirrored queues solve this by synchronizing message entities across nodes, at the cost of higher network bandwidth and reduced performance.

How to Build a Million‑Message Service

Google’s experiment used 32 virtual machines (8 CPU, 30 GB RAM each). The roles were:

30 RAM nodes (metadata kept in RAM only)

1 disc node (metadata persisted to disk)

1 stats node (runs the management plugin, no queues)

Under a production load of 1,345,531 msgs/s and a consumption load of 1,413,840 msgs/s, RabbitMQ only accumulated 2,343 messages in the waiting queue, showing no memory pressure or flow‑control triggers.

RabbitMQ Sharding Plugin

Enable the plugin with rabbitmq-plugins enable rabbitmq_sharding. Sharding creates multiple shard queues per node, automatically expanding when new nodes join the cluster. It uses a new exchange type x-modulus-hash that hashes the routing key and mods it by the number of bound queues, ensuring a message is routed to a single queue.

Consistent‑sharding Exchange

The consistent‑hash exchange distributes messages uniformly across queues based on a hash of the routing key. Queues are bound with a numeric string representing their weight in the hash space. This allows proportional load distribution (e.g., queue A receiving twice the traffic of queue B).

Reliability and Availability Discussion

To guarantee delivery, use publisher confirms and consumer acknowledgments. Publishers send confirm.select to enable confirm mode; the broker replies with confirm.select‑ok. After a message is routed to all intended queues (including mirrored queues), the broker sends basic.ack. Negative acknowledgments use basic.nack or basic.reject. Heartbeat frames detect TCP connection loss, which is used by ELB‑RabbitMQ setups.

Scenario 1: Ensuring Message Delivery Reliability

Enable confirm mode on the channel, and have consumers send explicit acks after successful processing. This ensures the broker knows when a message is safely persisted (for durable queues) and when it can be removed.

Scenario 2: Retry Mechanism After Failure

Configure a dead‑letter exchange (DLX) and TTL on a retry queue. When a consumer fails, the message is dead‑lettered to the DLX, routed to a retry queue, waits for the TTL, then is re‑published to the original work queue.

Scenario 3: Delayed Tasks

Implement delay by setting a TTL on a holding queue and a dead‑letter exchange that forwards expired messages to the work queue.

Scenario 4: Cross‑Center Message Sharing

The Federation plugin allows brokers in different data centers to share messages without forming a cluster. It works over AMQP, supports different users/virtual hosts, and can federate exchanges or queues.

Enable the plugin with rabbitmq-plugins enable rabbitmq_federation and the management UI with rabbitmq-plugins enable rabbitmq_federation_management. Define policies that match queue names to apply federation.

Scenario 5: High‑Availability Queue Configuration

Mirror queues across nodes so that each queue has a master and one or more slaves. If the master fails, a slave is promoted. Mirroring improves availability but adds network and CPU overhead.

Performance tests show mirrored queues have noticeable latency compared to normal clusters; increasing prefetch and adding resources can mitigate the impact.

Spring AMQP Integration

Spring AMQP provides the AmqpTemplate abstraction for interacting with RabbitMQ, allowing easy switching between AMQP brokers without code changes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Clustering Sharding Message Queue RabbitMQ scaling

Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.