Why We Chose RocketMQ for Our Online Messaging System and How We Built It

This article explains why the team built a dedicated online messaging system, the scenarios where Kafka fell short, the decision to adopt RocketMQ, deployment strategies, SDK design, load‑balancing, distributed monitoring, and performance tuning tips for a robust backend solution.

21CTO
21CTO
21CTO
Why We Chose RocketMQ for Our Online Messaging System and How We Built It

Why Build an Online Messaging System

Before introducing RocketMQ, Kuaishou heavily used Kafka, but Kafka does not fit all cases, such as needing per‑message retry without blocking other messages, delayed delivery, transactional consistency between DB operations and message sending, and the ability to query individual messages for troubleshooting.

Choosing RocketMQ

To address these scenarios we needed a messaging system focused on online services as a complement to Kafka. Among evaluated middlewares, RocketMQ matched our requirements best, offered a simple deployment architecture, and enjoys wide adoption, so we adopted it.

Deployment Modes and Strategy

There are two ways to introduce an open‑source component into an existing ecosystem:

Deeply modify the open‑source code to add custom features, which makes future upgrades difficult.

Keep the community version unchanged (or with minimal, non‑incompatible changes) and wrap it externally to provide required custom functionality.

We chose the second approach. Initially we used version 4.5.2; after the community released 4.7 with significantly reduced synchronous replication latency, we upgraded smoothly to the 4.7 series.

When deploying clusters we faced choices such as big vs. small clusters, replica count, sync vs. async flush, sync vs. async replication, and SSD vs. HDD. Small clusters offer better isolation and robustness without cross‑AZ deployment, while big clusters provide higher performance elasticity. Prioritising stability, we selected a small cluster with synchronous replication, asynchronous flush, and SSD storage.

Client Encapsulation Strategy

Because we do not deeply modify RocketMQ, we provide an SDK that offers the essential API for internal services. The SDK exposes only Topic (globally unique) and Group, abstracting away environment details and NameServer addresses. It resolves the appropriate cluster based on the Topic, enabling seamless environment isolation.

The architecture consists of three layers: a generic upper layer, a middle layer that handles routing, and a lower layer that interacts with the specific MQ implementation, allowing the client to switch to another middleware without code changes.

The SDK includes a hot‑change mechanism that can update routing, thread counts, timeouts, etc., without restarting the client, and uses Maven’s forced‑update to keep the SDK up‑to‑date.

Cluster Load Balancing & Disaster Recovery

Each Topic is replicated across two availability zones, and producers connect to at least two independent clusters. If one zone fails, traffic automatically switches to the other cluster. A lightweight failover library ( simple‑failover‑java) provides million‑OPS capacity, flexible weight adjustment, health checks, concurrency control, resource prioritisation, automatic priority management, and incremental hot‑changes.

Various Message Features

Delayed Messages

RocketMQ’s built‑in delayed messages support only a few fixed levels, so we built a separate Delay Server to schedule delayed messages. By switching the Topic, delayed messages are stored in RocketMQ, reusing the existing send interface while adding a delay field.

Transactional Messages

Since RocketMQ 4.3, transactional messages ensure that local DB transactions and message sending succeed or fail together. Use version 4.6.1 or later for stability. Transactional messages generate three internal messages, so throughput is roughly one‑third of normal messages; plan capacity accordingly. Important broker parameters include transientStorePoolEnable (must stay false), thread pool sizes for the two phases, lock settings, and transaction timeout (increase from 6 s to about 60 s).

Distributed Reconciliation Monitoring

We created a monitoring program that establishes a dedicated Topic on each Broker. The monitor sends a small number of messages using the same SDK as business producers, then checks send results (success, flush timeout, slave timeout, slave unavailable) and records detailed metrics. Consumers ACK via TCP, and the producer records ACK success, message loss, duplicates, latency, and ACK failures. The system also supports sampled reconciliation to reduce memory pressure during load tests.

Performance Optimisation

Default broker parameters are not optimal for our SSD, synchronous‑replication, async‑flush scenario. Key tunable parameters include: flushCommitLogTimed: set to true for async flush. sendMessageThreadPoolNums: increase to 2‑4. useReentrantLockWhenPutMessage: enable when thread pool is large. sendThreadPoolQueueCapacity: raise to handle high TPS. brokerFastFailureEnable: consider disabling for short client timeouts. waitTimeMillsInSendQueue: increase from 200 ms to around 1000 ms. osPageCacheBusyTimeOutMills: increase for large memory machines.

Conclusion

Thanks to a simple, near‑zero‑dependency deployment model, we can run low‑cost small clusters, avoid heavy community version modifications, and upgrade promptly. A unified SDK simplifies cluster maintenance and feature upgrades. Combined small‑cluster deployment with automatic load balancing achieves multi‑AZ active‑active availability. Leveraging RocketMQ features such as transactional and enhanced delayed messages meets diverse business needs, while automated distributed reconciliation ensures correctness of every Broker and SDK instance.

Author: Huang Li, 10+ years of software development and architecture experience, passionate about code and performance optimisation, former business architect at Taobao, currently responsible for online messaging system construction at Kuaishou.
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

load balancingperformance tuningRocketMQMessaging System
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.