Operations 7 min read

Root Cause Analysis and Resolution of RocketMQ Timeout Issues in a Docker-Deployed Cluster

This article details the investigation of frequent RocketMQ timeout errors in a company's procurement service, identifies misconfigured brokerIP2 and unstable master‑slave network as the root causes, and provides step‑by‑step remediation procedures to restore reliable message delivery.

YunZhu Net Technology Team

Dec 30, 2021

Root Cause Analysis and Resolution of RocketMQ Timeout Issues in a Docker-Deployed Cluster

Problem

RocketMQ is used as the company's message middleware across many business scenarios, but the procurement service recently experienced frequent message‑sending timeouts, causing transaction rollbacks and severe business impact.

Investigation

Business timeout exception location

The timeout originates from RocketMQ's default 3‑second response window; although messages appear in the server, they may be delayed due to disk write latency or master‑slave synchronization issues.

MQ service GC analysis and log tracing

Operational logs were collected, showing many write‑timeout warnings, but GC logs indicated no Full GC events during the period.

Check master‑slave configuration & replication

The cluster uses master‑slave synchronous replication with asynchronous flush, eliminating disk‑write issues. Log analysis revealed timeout entries during master‑slave sync and heartbeat expirations, confirming that master‑slave latency caused the business timeouts.

Master‑slave network check & brokerIP2 misconfiguration

The slave node was connecting to an IP address of a container rather than the master’s actual IP, leading to unstable network communication. Further inspection showed that the master’s brokerIP2 was set to the container’s IP, causing incorrect synchronization.

BrokerIP2 configuration error

Because the Docker‑deployed cluster did not explicitly set brokerIP2, RocketMQ defaulted to the container’s network IP, causing the slave to fail to connect correctly and resulting in timeout errors.

Resolution

1. Update the master node’s permissions to read‑only to pause new writes:

sh mqadmin updateBrokerConfig -b xx.xx.xx.xx:10911 -k brokerPermission -v 4

2. Correct the master’s brokerIP2 to the proper IP:

sh mqadmin updateBrokerConfig -b xx.xx.xx.xx:10911 -k brokerIP2 -v xx.xx.xx.xx

3. Re‑mount a new slave node or clear the old slave’s data and restart it.

4. Wait for the new slave to synchronize with the master.

5. Restore the master’s permissions to read‑write to resume normal operation.

Alternative: Rebuild the master‑slave cluster

a. Deploy new master and slave nodes registered with the same NameServer.

b. Import existing topics and consumer groups into the new cluster and verify functionality.

c. Set the old cluster to read‑only, wait for pending consumption to finish, and prevent new writes.

d. Decommission the old MQ cluster.

Author: Zhang Xingjun (Jian Yong) Reviewer: Wu Youqiang (Ji Dian) Editor: Zhou Xulong (Ai Di Sheng)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Docker Message Queue rocketmq Troubleshooting Timeout broker-configuration

Written by

YunZhu Net Technology Team

Technical practice sharing from the YunZhu Net Technology Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.