Backend Development 14 min read

Design and Implementation of a High‑Availability RabbitMQ Middleware Platform at vivo

vivo built a high‑availability RabbitMQ middleware platform that combines an MQ‑Portal for request‑driven provisioning, an SDK that adds application‑level authentication, automatic cluster discovery, rate‑limiting, reset and blockage‑transfer capabilities, and a stateless MQ‑NameServer for name resolution and health‑based failover, enabling ten‑fold traffic growth without incidents.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Design and Implementation of a High‑Availability RabbitMQ Middleware Platform at vivo

Background

vivo introduced RabbitMQ in 2016 and extended the open‑source product to provide a messaging middleware service for its business. From 2016 to 2018 a single cluster served all services, leading to heavy load and frequent failures. In 2019 a high‑availability construction was completed, including a name service for MQ and a same‑city active‑active cluster. Physical partitioning of clusters was performed based on load and traffic, and since then traffic has increased tenfold without serious incidents.

RabbitMQ implements the AMQP protocol and originated from financial systems.

Key Features of RabbitMQ

Message reliability through publisher confirms, clustering, persistence, mirrored queues, and consumer acknowledgments.

Multi‑language client libraries.

Various exchange types for routing messages to queues.

Comprehensive management UI and API for integration with monitoring systems.

Problems Discovered in Practice

Multiple isolated clusters lack a unified management platform.

Native clients must handle cluster addresses directly, causing confusion.

Only simple username/password authentication; no application‑level authorization, leading to exchange/queue misuse.

Missing platform to maintain relationships between producers and consumers across versions.

Unbounded client traffic can overwhelm clusters.

No built‑in retry strategy for abnormal messages.

Cluster memory overflow or blockage cannot be automatically switched to another cluster.

Mirrored queues can cause uneven node load when many queues exist.

RabbitMQ lacks automatic queue balancing.

Overall Architecture

The solution consists of three main components:

MQ‑Portal : a web portal for application‑level MQ usage requests. It records metadata such as producer/consumer applications, exchange/queue names, and traffic estimates. After approval via an internal ticket workflow, the portal assigns a specific cluster and creates the required exchange/queue bindings.

MQ‑SDK : a client SDK built on spring‑message and spring‑rabbit that adds application‑level authentication, cluster addressing, rate limiting, production‑consumer reset, and blockage transfer capabilities.

MQ‑NameServer : a stateless service deployed in a cluster to provide high‑availability name resolution, health reporting, and fast failover for the SDK.

MQ‑Portal Workflow

Users submit a request through the portal, which captures the intended exchange/queue, traffic, and application identifiers. The request is routed to an internal ticket system for approval. Once approved, the ticket callback allocates a specific cluster and creates the necessary resources.

Each exchange/queue is linked to a cluster via a unique rmq.topic.key and rmq.secret.key . These keys are delivered to the SDK during startup, enabling automatic cluster discovery.

Client SDK Capabilities

2.1 Application Authentication

The SDK collaborates with MQ‑NameServer to verify that a client is authorized to use a particular exchange/queue. During startup the SDK reports its rmq.topic.key to the name server, which validates the mapping. The SDK also performs a second check before each send.

/**
  * Send‑precheck and obtain the real producer factory. Allows multiple beans but uses a single bean for all messages.
  * @param exchange The exchange to validate.
  * @return The producer factory.
  */
public AbstractMessageProducerFactory beforeSend(String exchange) {
    if (closed || stopped) {
        // Context closed, throw exception to prevent further sends.
        throw new RmqRuntimeException(String.format("producer sending message to exchange %s has closed, can't send message", this.getExchange()));
    }
    if (exchange.equals(this.exchange)) {
        return this;
    }
    if (!VIVO_RMQ_AUTH.isAuth(exchange)) {
        throw new VivoRmqUnAuthException(String.format("发送topic校验异常,请勿向无权限exchange %s 发送数据,发送失败", exchange));
    }
    // Retrieve the real producer bean to avoid sending errors.
    return PRODUCERS.get(exchange);
}

2.2 Cluster Addressing

Based on the rmq.topic.key , the SDK automatically resolves the appropriate cluster, abstracting multiple physical clusters from the application.

2.3 Client Rate Limiting

The native RabbitMQ client has no flow control. The SDK adds a rate‑limiting layer to protect clusters from abusive traffic spikes.

2.4 Production‑Consumer Reset

When a cluster is split or experiences failures, the SDK can reset connections and restart producers/consumers without restarting the application.

Steps:

Reset connection factory parameters.

Reset the connection.

Establish a new connection.

Restart production and consumption.

CachingConnectionFactory connectionFactory = new CachingConnectionFactory();
connectionFactory.setAddresses(address);
connectionFactory.resetConnection();
rabbitAdmin = new RabbitAdmin(connectionFactory);
rabbitTemplate = new RabbitTemplate(connectionFactory);

The SDK also includes an exception‑message retry strategy to avoid loss during reset.

2.5 Blockage Transfer

If a node exceeds memory or disk thresholds, RabbitMQ blocks publishing. With same‑city active‑active clusters, the SDK can shift traffic to the healthy cluster after a reset.

2.6 Multi‑Cluster Scheduling

When a single cluster cannot handle traffic, the SDK can distribute load across multiple clusters, leveraging mirrored queues and the name server for routing.

MQ‑NameServer Functions

Authenticate SDK startup and locate the appropriate cluster.

Collect periodic metrics (send/consume counts) and return the current healthy cluster address.

Trigger production‑consumer reset commands.

High‑Availability Deployment Practices

RabbitMQ clusters are deployed in same‑city active‑active mode with at least three nodes per cluster (recommended 5‑7 nodes). Strategies include:

Using the pause_minority split‑brain recovery policy to automatically pause minority partitions.

Deploying mirrored queues with lazy mode, persistence, and durable exchanges to ensure data safety.

Future Challenges

The current solution focuses on SDK and NameServer enhancements. Future work aims to build a middleware proxy layer to simplify SDK usage and provide finer‑grained traffic management.

backendSDKMessage QueueRabbitMQCluster ManagementHigh Availability
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.