Why RocketMQ Throws TIMEOUT_CLEAN_QUEUE: Deep Dive into Broker Fast-Failure Bug
The article examines a recurring RocketMQ error (TIMEOUT_CLEAN_QUEUE) caused by the broker’s fast‑failure mechanism, explains how the broker queues and thread pools operate, reveals that SYSTEM_BUSY is mistakenly excluded from retry logic—a bug—and proposes both a temporary configuration tweak and a permanent code fix via a PR.
Problem Phenomenon
Project feedback reported a RocketMQ error: MQBrokerException: CODE:2 DESC:[TIMEOUT_CLEAN_QUEUE] broker busy, start flow control for a while, period in queue: 205ms, size of queue: 880 . The team had no compensation for failed message sends, leading to message loss.
Problem Analysis
Searching the keyword TIMEOUT_CLEAN_QUEUE in RocketMQ source points to the BrokerFastFailure class, which implements a fast‑failure mechanism on the broker side.
The broker receives message write requests and places them into SendThreadPoolQueue (default capacity 10,000). A dedicated thread pool SendMessageExecutor (default single thread) processes these tasks to preserve order.
When GC or other factors cause write latency spikes, the queue can back up, extending client send times.
If a single broker takes 500 ms–1 s per message and the queue holds 5,000 messages, the client’s default 3 s timeout will be exceeded, causing many requests to time out.
To mitigate this, RocketMQ introduces a fast‑failure thread that checks the head of the queue every 10 ms; if a request has waited over 200 ms, all such requests are cancelled and a failure is returned immediately, allowing the client to retry on another broker.
Despite this design, users still observe the TIMEOUT_CLEAN_QUEUE error, suggesting that the broker may not be retrying as intended.
Source Code Investigation
The client’s processSendResponse method receives a response code SYSTEM_BUSY and throws an MQBrokerException with the same description.
Tracing the call chain leads to DefaultMQProducerImpl.sendKernelImpl, which catches exceptions, executes any registered hook, and re‑throws the exception.
The higher‑level sendDefaultImpl wraps sendKernelImpl in a for loop with try‑catch to implement retries. However, the retry logic only triggers for a specific set of error codes; SYSTEM_BUSY is omitted.
This omission means that when the broker returns SYSTEM_BUSY, the client does not retry, contradicting the fast‑failure design. The author identifies this as a bug in RocketMQ.
Solution
Many online suggestions recommend increasing waitTimeMillsInSendQueue (default 200 ms) to a larger value such as 1000 ms, which can alleviate the symptom but does not fix the root cause.
The proper fix is to submit a pull request that adds SYSTEM_BUSY to the retry list in the broker’s fast‑failure logic.
In the meantime, developers should implement their own retry mechanism: catch send exceptions, persist failed messages to a database, and use a scheduled task to retry, ensuring higher reliability independent of RocketMQ’s built‑in retries.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
