RocketMQ’s Cloud‑Native Operator: 30% Faster Filtering and POP Consumption
This article details how Alibaba Cloud transformed RocketMQ with a Kubernetes‑based operator, optimized message filtering by indexing MessageType for up to 30% CPU reduction, and introduced a POP consumption model that eliminates rebalance delays, achieving stable performance during the 2020 Double‑11 peak.
Background
RocketMQ has supported Alibaba Group’s Double‑11 shopping festivals for seven consecutive years with zero failures, handling peak transaction rates of 583,000 messages per second in 2020. However, the existing deployment relied on a custom middleware platform that required manual steps, suffered from scaling pain, and lacked true cloud‑native automation.
Cloud‑Native Transformation
The team built a Kubernetes‑based operator to manage RocketMQ clusters. By defining a custom CRD that abstracts the broker model, the operator handles pod creation, configuration, scaling, migration, and metadata synchronization, removing the need for manual IaaS‑level operations. This shift also eliminated the traditional master‑slave deployment pattern, allowing all brokers to run as identical stateless pods that can self‑heal.
Performance Optimization of Message Filtering
During large‑scale promotions, the transaction message filtering logic became a major CPU cost because thousands of subscription expressions (mostly MessageType == xxx) were evaluated using Aviator scripts, which ultimately called String.compareTo(). To accelerate this, the team indexed the MessageType field:
Extract MessageType from each Aviator expression by hooking into the recursive‑descent parser.
Store the extracted expressions in a HashMap<MessageType, List<Expression>> so that a single hash lookup filters out the majority of non‑matching rules.
Two cases were handled:
If messageType == '200-trade-paid-done', the expression reduces to the remaining conditions (e.g., buyerId==123456).
If messageType != '200-trade-paid-done', the expression short‑circuits to false.
Complex logical combinations (e.g., multiple OR branches) were also supported by preserving the “not‑equal” path.
New POP Consumption Model
The traditional Pull model suffered from consumer hang‑ups: a stalled client retained its queue assignment, causing message backlog. POP consumption replaces rebalance with a request‑based approach where each client issues POP requests to all brokers, and brokers distribute messages based on an internal algorithm. If a client hangs, other clients continue to consume its pending messages.
POP workflow:
Broker locks the target queue and reads messages from the store.
Writes a CK (checkpoint) message recording the POP position.
Commits the offset and releases the lock.
CK messages enable retries: if a client does not acknowledge within a timeout, the broker re‑processes the CK entry and moves the message to a retry queue.
Results
After deploying the operator and the POP model, the Double‑11 promotion showed stable send‑RT metrics. The MessageType indexing reduced CPU usage by up to 32% for complex subscription expressions, significantly lowering the cost of the transaction clusters. POP consumption eliminated rebalance‑induced latency and prevented message pile‑up caused by hung consumers.
Conclusion
The cloud‑native operator and POP consumption together modernized RocketMQ’s architecture, achieving zero‑failure operation, improved performance, and simplified operations on Kubernetes.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
