Why One Proxy Node’s MQ Queue Stalled: A Deep Dive into HTTP Client Bugs and Rate Limiting

An in‑depth case study explains how a single Proxy machine’s message‑queue backlog was traced to a stuck HTTP download thread caused by an HttpClient bug, detailing the step‑by‑step investigation, root‑cause analysis, and the final fix that eliminated the issue.

dbaplus Community
dbaplus Community
dbaplus Community
Why One Proxy Node’s MQ Queue Stalled: A Deep Dive into HTTP Client Bugs and Rate Limiting

Background

The service consists of three components:

Proxy – request proxy with fine‑grained flow control.

Latu SDK – downloads images for downstream AI models.

Estimation Engine – runs various machine‑learning / deep‑learning models.

Problem Description

An alert indicated a large MQ backlog at the Proxy entry. Among 500 Proxy instances, only one machine showed severe message accumulation while the others were normal.

One‑Sentence Root Cause

A consumer thread on the problematic machine was stuck during an HTTP image download.

The HTTP client version had a bug that prevented HTTPS connection timeouts from taking effect, causing the thread to block indefinitely.

Investigation Process

1. Consumption speed check

TPS and processing latency of the affected machine matched healthy nodes. A flame‑graph generated by Arthas showed narrow, evenly distributed flames, indicating no business‑logic bottleneck.

2. System metrics

CPU, memory, load average and GC activity were within normal ranges and identical to healthy machines.

3. Rate limiting

Proxy applies a RateLimiter when downstream capacity is exceeded. Monitoring showed only mild blocking and the other 499 machines were unaffected, making rate limiting an unlikely cause.

4. MQ data skew

RocketMQ uses the default selectOneMessageQueue strategy, which distributes messages uniformly across queues using index % queue_size. This eliminates the possibility of data skew.

5. CPU steal

CPU steal time (time taken by other VMs) was normal, so it was excluded.

6. Consumption offset not advancing

MQ logs for the problematic queueId showed that the consumption offset had not moved for an extended period, confirming that the queue was not being drained. MQ caches up to 1000 messages in memory; when the cache is full, no further pulls occur, matching the observed stagnant offset.

7. Stuck consumer thread

Running jstack revealed thread 251 remained in RUNNABLE state while other threads were waiting or processing. The stack trace pointed to getImageDetail, which performs an HTTP image download before invoking the deep‑learning model.

After the thread hung, no further logs were produced, confirming an indefinite block.

8. Why HTTP hung?

The code set only a socket timeout (5 s) but omitted a connection timeout. For HTTPS URLs, the TCP connection phase can block indefinitely if the remote host never establishes a connection.

9. Root cause of the HTTP hang

A known bug in the specific HttpClient version (Apache issue HTTPCLIENT‑1478) applies the timeout **after** the SSL connection attempt, effectively disabling the timeout for HTTPS connections. All problematic URLs were HTTPS, confirming the relevance of the bug.

Resolution

Upgrading to a newer HttpClient version removed the bug. After redeployment, the stuck thread disappeared, the consumption offset advanced normally, and the MQ backlog was cleared.

Key Technical Points

Use diagnostic tools such as jstack, Arthas, and jprofile to locate abnormal threads.

Understand RocketMQ’s offset semantics (At‑Least‑Once): the offset advances only when the earliest un‑acknowledged message is successfully processed.

When a consumer thread blocks, the entire queue’s offset stalls, causing the cache to fill and further pulls to stop, which leads to backlog.

HTTPS connection timeout must be configured separately from socket timeout; otherwise, a connection‑phase hang can persist.

Check third‑party library issue trackers for known bugs that affect timeout handling.

Reference URLs

https://issues.apache.org/jira/browse/HTTPCLIENT-1478
https://stackoverflow.com/questions/7360520/connectiontimeout-versus-sockettimeout
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Proxytroubleshootingrate limitingHTTP client
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.