Root Cause Analysis and Resolution of RocketMQ Timeout Issues in a Docker-Deployed Cluster
This article details the investigation of frequent RocketMQ timeout errors in a company's procurement service, identifies misconfigured brokerIP2 and unstable master‑slave network as the root causes, and provides step‑by‑step remediation procedures to restore reliable message delivery.
Problem
RocketMQ is used as the company's message middleware across many business scenarios, but the procurement service recently experienced frequent message‑sending timeouts, causing transaction rollbacks and severe business impact.
Investigation
Business timeout exception location
The timeout originates from RocketMQ's default 3‑second response window; although messages appear in the server, they may be delayed due to disk write latency or master‑slave synchronization issues.
MQ service GC analysis and log tracing
Operational logs were collected, showing many write‑timeout warnings, but GC logs indicated no Full GC events during the period.
Check master‑slave configuration & replication
The cluster uses master‑slave synchronous replication with asynchronous flush, eliminating disk‑write issues. Log analysis revealed timeout entries during master‑slave sync and heartbeat expirations, confirming that master‑slave latency caused the business timeouts.
Master‑slave network check & brokerIP2 misconfiguration
The slave node was connecting to an IP address of a container rather than the master’s actual IP, leading to unstable network communication. Further inspection showed that the master’s brokerIP2 was set to the container’s IP, causing incorrect synchronization.
BrokerIP2 configuration error
Because the Docker‑deployed cluster did not explicitly set brokerIP2, RocketMQ defaulted to the container’s network IP, causing the slave to fail to connect correctly and resulting in timeout errors.
Resolution
1. Update the master node’s permissions to read‑only to pause new writes:
sh mqadmin updateBrokerConfig -b xx.xx.xx.xx:10911 -k brokerPermission -v 42. Correct the master’s brokerIP2 to the proper IP:
sh mqadmin updateBrokerConfig -b xx.xx.xx.xx:10911 -k brokerIP2 -v xx.xx.xx.xx3. Re‑mount a new slave node or clear the old slave’s data and restart it.
4. Wait for the new slave to synchronize with the master.
5. Restore the master’s permissions to read‑write to resume normal operation.
Alternative: Rebuild the master‑slave cluster
a. Deploy new master and slave nodes registered with the same NameServer.
b. Import existing topics and consumer groups into the new cluster and verify functionality.
c. Set the old cluster to read‑only, wait for pending consumption to finish, and prevent new writes.
d. Decommission the old MQ cluster.
Author: Zhang Xingjun (Jian Yong) Reviewer: Wu Youqiang (Ji Dian) Editor: Zhou Xulong (Ai Di Sheng)
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
YunZhu Net Technology Team
Technical practice sharing from the YunZhu Net Technology Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
