How Vivo Scaled RabbitMQ to Ten‑Fold Traffic with High‑Availability Architecture
This article details Vivo's evolution of RabbitMQ from a single cluster to a multi‑cluster, high‑availability solution, describing the MQ‑Portal request workflow, the enriched client SDK features, the stateless MQ‑NameServer, and the dual‑city deployment strategies that enabled a ten‑times traffic increase without major outages.
Background
Vivo introduced RabbitMQ in 2016, extending the open‑source broker to provide a messaging middleware service. Between 2016 and 2018 all services shared a single cluster, which became overloaded and suffered frequent failures as business grew.
In 2019 Vivo built a high‑availability layer, adding a name‑service for MQ and implementing same‑city dual‑active clusters. Physical isolation of clusters was performed, assigning workloads based on load and traffic, and dynamically adjusting allocations. Since then traffic has grown tenfold while the clusters have remained stable.
Overall Architecture
Vivo's architecture consists of three core components:
MQ‑Portal : a web‑based application request platform that records metadata such as producer, consumer, exchange, queue, and traffic.
MQ‑SDK : a client library built on spring‑message and spring‑rabbit that adds authentication, cluster addressing, rate limiting, production‑consumer reset, and blockage transfer capabilities.
MQ‑NameServer : a stateless service that provides SDK authentication, cluster location, metric reporting, and fast failover.
The diagram below illustrates the high‑level flow (image omitted for brevity).
MQ‑Portal – Application Request Process
Previously, application usage of RabbitMQ was tracked in scattered spreadsheets, leading to outdated information. MQ‑Portal introduces a visual, platform‑wide request workflow:
Developers submit a request specifying the producer, consumer, exchange, queue, and expected traffic.
The request enters Vivo's internal ticket system for approval.
Upon approval, the ticket callback allocates a specific cluster and automatically creates the required exchange/queue bindings.
Because multiple isolated clusters are used, each exchange/queue is linked to a unique rmq.topic.key and rmq.secret.key. These keys are distributed via the ticket callback, enabling the SDK to locate the correct cluster at startup.
Client SDK Capabilities
1) Application‑Level Authorization
The open‑source RabbitMQ only checks username/password, not whether an application is permitted to use a particular exchange or queue. The SDK collaborates with MQ‑NameServer to verify that the rmq.topic.key in the request matches the authorized application, rejecting unauthorized sends.
2) Cluster Addressing
Each application may be assigned to a different cluster based on load. The SDK reads the rmq.topic.key and automatically resolves the correct cluster address, abstracting the multi‑cluster complexity from developers.
3) Client Rate Limiting
Without limits, a misbehaving application can flood a cluster and affect all tenants. The SDK provides configurable rate‑limiting to protect cluster stability.
4) Production‑Consumer Reset
When a cluster is split or a consumer disconnects, the SDK can reset connections, recreate factories, and restart producers/consumers. The reset flow includes:
Reset connection factory parameters.
Close existing connections.
Establish new connections.
Restart production and consumption.
Example code used for resetting:
CachingConnectionFactory connectionFactory = new CachingConnectionFactory();
connectionFactory.setAddresses(address);
connectionFactory.resetConnection();
rabbitAdmin = new RabbitAdmin(connectionFactory);
rabbitTemplate = new RabbitTemplate(connectionFactory);The SDK also implements an exception‑resend strategy to avoid message loss during resets.
5) Blockage Transfer
When a node exceeds memory or disk thresholds, RabbitMQ blocks publishing. Vivo's dual‑active clusters allow automatic transfer of blocked traffic to the standby cluster via the production‑consumer reset mechanism.
6) Multi‑Cluster Scheduling
As traffic grows, a single cluster cannot scale horizontally due to mirrored queues. The SDK can distribute load across multiple clusters, ensuring high throughput without overloading any single node.
MQ‑NameServer – Fast Failover Support
MQ‑NameServer is a stateless, highly available service that:
Authenticates SDK startups and resolves the target cluster.
Collects and reports SDK metrics (sent/consumed message counts) and returns the current healthy cluster address.
Triggers production‑consumer reset commands when needed.
MQ‑Server High‑Availability Deployment
1) Split‑Brain Handling
RabbitMQ offers three split‑brain recovery strategies. Vivo selected pause_minority , which pauses the minority partition until a majority reconnection is restored, avoiding data loss while maintaining service availability.
2) HA Cluster Design
Each cluster runs at least three nodes; Vivo recommends five‑ or seven‑node deployments. All queues are mirrored, exchanges and messages are durable, and queues are configured as lazy to reduce memory pressure.
3) Same‑City Dual‑Active Architecture
Two data centers host identical clusters linked via the Federation plugin. Applications preferentially connect to the local data center, and the NameServer provides heartbeat‑based cluster health, allowing automatic reconnection to the standby site during failures.
Future Challenges and Outlook
Current enhancements focus on the SDK and NameServer. Future work aims to introduce a middleware proxy layer that abstracts the SDK complexity and offers finer‑grained traffic management across clusters.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
