Why RocketMQ Took 10 Minutes to Recover: Deep Dive into a Production Outage
A production RocketMQ cluster suffered a 10‑minute message‑send timeout after a broker’s memory failure caused a server reboot, and the post analyzes the routing registration, name‑server removal logic, client‑side timeout handling, root cause of name‑server “dead‑lock”, and proposes deployment and code‑level fixes to prevent recurrence.
1. Fault Description
The author’s production RocketMQ cluster (2 masters, 2 slaves) experienced a severe incident when a physical machine (192.168.3.100) suffered a memory fault, triggering a Linux reboot that lasted nearly ten minutes. During this period, all client applications reported message‑send timeouts, classifying the incident as S1.
2. Fault Analysis
RocketMQ’s routing registration and removal mechanisms work as follows:
Every 30 seconds each broker sends a heartbeat to all NameServers, registering its topic routes.
NameServers update the routing table upon receiving heartbeats and record the receipt time.
A scheduled task runs every 10 seconds; if a NameServer has not received a broker’s heartbeat for 120 seconds, it marks the broker as offline and removes it from the routing table.
If the TCP connection between a NameServer and a broker is broken, the NameServer immediately detects the broker’s offline status.
Clients maintain a single active connection to one NameServer, refreshing local routing information every 30 seconds; failures to query are ignored.
Two scenarios arise when a broker becomes unavailable:
If the TCP connection breaks, the NameServer removes the broker within ~30 seconds, and the client perceives the change within roughly the same time.
If the broker “hangs” without breaking the TCP connection, the NameServer needs the full 120 seconds before it removes the broker, and the client may need up to 150 seconds to notice the routing change.
In the incident, the broker’s host rebooted but the TCP connection remained alive, so the NameServer only detected the failure after 120 seconds. However, the client did not update its routing until 150 seconds later (14:53:46), because it kept trying to fetch routing info from the now‑unresponsive NameServer, receiving timeout errors.
3. Root Cause
The client’s RPC layer keeps a cached channel as long as isActive() returns true. When a non‑timeout exception occurs, the channel is closed via closeChannel(). For timeout exceptions, RocketMQ uses the clientCloseSocketIfTimeout flag, which defaults to false and cannot be changed via configuration. Consequently, the client never closed the stale TCP connection to the failed NameServer, preventing automatic NameServer failover.
Thus, the fundamental issue was the NameServer’s “dead‑lock” state combined with the client’s inability to close the timed‑out connection, causing a prolonged outage.
4. Best Practices
To avoid similar incidents, the author recommends:
Deploy NameServers and Brokers on separate physical or virtual machines to isolate failures.
Modify the client code so that a timeout triggers closeChannel(), forcing the client to switch to another healthy NameServer.
With the isolation architecture, if a broker becomes dead, the remaining NameServers detect the failure within two minutes, and clients receive updated routing information promptly, preventing prolonged message‑send timeouts.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
