Cloud Native 14 min read

How RocketMQ Achieved Zero‑Failure Double‑11 with Cloud‑Native Operator and POP Consumption

This article details RocketMQ's 2020 Double‑11 transformation, covering its migration to a Kubernetes‑based operator, a MessageType‑indexed filter optimization that cut CPU usage by up to 32%, and the introduction of a POP consumption model that eliminates rebalance‑induced latency and improves reliability.

Alibaba Cloud Native

Apr 3, 2021

How RocketMQ Achieved Zero‑Failure Double‑11 with Cloud‑Native Operator and POP Consumption

Background

In the 2020 Double‑11 promotion, the transaction peak reached 58.3 W transactions per second, and RocketMQ supported the entire workload with zero failures. The article outlines three major changes made during that promotion: cloud‑native migration, performance optimization, and a new consumption model.

Cloud‑Native Practice

The goal was to move RocketMQ operations onto Kubernetes to achieve automated, unattended management and reduce operational costs. Previously, a middleware deployment platform handled container creation but required custom code for each middleware, resulting in non‑cloud‑native processes. An operator was built to encapsulate the broker lifecycle (creation, scaling, migration) and replace manual steps such as traffic cut‑over and data‑backlog verification.

The operator abstracts the broker model via a custom CRD, runs on the internal Kubernetes cluster, and hides all IaaS details. It automatically generates broker names, config files, synchronizes metadata, and integrates previously manual operations like flow observation and data‑stack checks.

After migration, the architecture no longer distinguishes master‑slave pairs; all brokers are identical, benefiting from high‑performance cloud disks and synchronous flushing, which guarantees no message loss and enables automatic self‑healing.

Performance Optimization

RocketMQ’s transaction‑message filtering relied on Aviator expressions that compared MessageType strings using String.compareTo(), causing high CPU usage as the number of MessageType values grew. The optimization introduced an index‑like approach: extract MessageType from each expression, store sub‑expressions in a HashMap keyed by MessageType, and evaluate the remaining conditions only for matching keys.

Technical steps:

Hook into Aviator’s recursive‑descent compiler to detect patterns like messageType == 'xxx', replace them with true/false, and apply short‑circuit evaluation.

Build a HashMap where the key is MessageType and the value is the remaining sub‑expression.

Example transformation:

Expression: messageType=='200-trade-paid-done' && buyerId==123456
Extracted sub‑expressions:
1) (messageType=='200-trade-paid-done'): buyerId==123456
2) (messageType!='200-trade-paid-done'): false

Benchmarks showed up to 32% CPU reduction for complex subscription filters, significantly lowering the machine cost for the 2020 promotion.

New Consumption Model – POP

The traditional PULL model suffered when a client hung: the broker still assigned queues to the dead client, causing message backlog. POP (Push‑On‑Pull) removes the rebalance step; each client directly requests messages from all brokers, and the broker distributes messages based on an internal algorithm. If a client hangs, other clients continue consuming the same queues, preventing accumulation.

POP workflow:

Lock the target queue and fetch messages from the store.

Write a CK (checkpoint) message indicating the POP request.

Commit the offset and release the lock.

CK messages act as timers; if the client does not ACK within the timeout, the broker re‑queues the message for retry. This design eliminates rebalance‑induced latency and improves stability.

Promotion Verification

During Double‑11, the RT (response time) of message sending remained stable, confirming that the cloud‑native migration, filter optimization, and POP model met the expected performance and reliability goals.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Performance Optimization Kubernetes Message Queue RocketMQ

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.