Operations 9 min read

Why RocketMQ Took 10 Minutes to Recover: Deep Dive into a Production Outage

A production RocketMQ cluster suffered a 10‑minute message‑send timeout after a broker’s memory failure caused a server reboot, and the post analyzes the routing registration, name‑server removal logic, client‑side timeout handling, root cause of name‑server “dead‑lock”, and proposes deployment and code‑level fixes to prevent recurrence.

dbaplus Community
dbaplus Community
dbaplus Community
Why RocketMQ Took 10 Minutes to Recover: Deep Dive into a Production Outage

1. Fault Description

The author’s production RocketMQ cluster (2 masters, 2 slaves) experienced a severe incident when a physical machine (192.168.3.100) suffered a memory fault, triggering a Linux reboot that lasted nearly ten minutes. During this period, all client applications reported message‑send timeouts, classifying the incident as S1.

2. Fault Analysis

RocketMQ’s routing registration and removal mechanisms work as follows:

Every 30 seconds each broker sends a heartbeat to all NameServers, registering its topic routes.

NameServers update the routing table upon receiving heartbeats and record the receipt time.

A scheduled task runs every 10 seconds; if a NameServer has not received a broker’s heartbeat for 120 seconds, it marks the broker as offline and removes it from the routing table.

If the TCP connection between a NameServer and a broker is broken, the NameServer immediately detects the broker’s offline status.

Clients maintain a single active connection to one NameServer, refreshing local routing information every 30 seconds; failures to query are ignored.

Two scenarios arise when a broker becomes unavailable:

If the TCP connection breaks, the NameServer removes the broker within ~30 seconds, and the client perceives the change within roughly the same time.

If the broker “hangs” without breaking the TCP connection, the NameServer needs the full 120 seconds before it removes the broker, and the client may need up to 150 seconds to notice the routing change.

In the incident, the broker’s host rebooted but the TCP connection remained alive, so the NameServer only detected the failure after 120 seconds. However, the client did not update its routing until 150 seconds later (14:53:46), because it kept trying to fetch routing info from the now‑unresponsive NameServer, receiving timeout errors.

3. Root Cause

The client’s RPC layer keeps a cached channel as long as isActive() returns true. When a non‑timeout exception occurs, the channel is closed via closeChannel(). For timeout exceptions, RocketMQ uses the clientCloseSocketIfTimeout flag, which defaults to false and cannot be changed via configuration. Consequently, the client never closed the stale TCP connection to the failed NameServer, preventing automatic NameServer failover.

Thus, the fundamental issue was the NameServer’s “dead‑lock” state combined with the client’s inability to close the timed‑out connection, causing a prolonged outage.

4. Best Practices

To avoid similar incidents, the author recommends:

Deploy NameServers and Brokers on separate physical or virtual machines to isolate failures.

Modify the client code so that a timeout triggers closeChannel(), forcing the client to switch to another healthy NameServer.

With the isolation architecture, if a broker becomes dead, the remaining NameServers detect the failure within two minutes, and clients receive updated routing information promptly, preventing prolonged message‑send timeouts.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OperationsRocketMQMessagingfault-analysisbest-practiceshigh-availability
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.