How Arthas Saved a Double‑11 Sale: Debugging a Thread‑Pool Nightmare

During a Double‑11 promotion, a massive request timeout caused order success rates to plunge, but by using Arthas, jstack, and targeted code analysis, the team identified a non‑thread‑safe HashBiMap in a global cache, halted the outage, and implemented fixes to prevent future failures.

Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
How Arthas Saved a Double‑11 Sale: Debugging a Thread‑Pool Nightmare
"A terrifying incident" Event: Core user‑service cluster timed out during Double‑11 peak. Impact: Order success rate dropped sharply, causing tens of millions in ad spend loss; classified as P1. Duration: 15 minutes to stop bleeding, ~30 minutes to locate root cause. Root cause: A globally shared singleton misused the non‑thread‑safe HashBiMap , leading to a cyclic linked list under high concurrency and exhausting the thread pool.

Hello, I’m Lao A. After a poll, many wanted to see how I used Arthas to locate an online issue, so I’m sharing the full story of how I quickly pinpointed the problem and avoided massive ad‑spend loss.

Act 1: The Eye of the Storm – A Silent Avalanche

On 2023‑11‑11 at 13:59, the team monitored core metrics during the promotion. At 14:01, the order success rate fell from 99.9% to below 70% within five minutes, triggering a P1 alarm.

Detailed Timeline and Investigation

14:02 – First round: Conflicting information

System metrics: CPU, memory, network, QPS looked normal overall.

Drilling down to individual nodes revealed a few pods with 100% CPU, exhausting the Tomcat worker thread pool while the inbound QPS remained non‑zero.

Why thread pool full but QPS not zero? Tomcat’s acceptor threads handle health checks and simple requests, so QPS appears stable even when workers are saturated.

14:10 – Second round: jstack misdiagnosis

Collected three thread dumps with jstack. No obvious GC or APM anomalies; all clues pointed inside the application.

Dump showed many threads in RUNNABLE state, stack traces pointing to HashBiMap.seekByKey without deadlock signs.

jstack limitation: It captures static snapshots, detecting deadlocks but not infinite loops or livelocks; the threads were busy‑spinning.

14:15 – Bleeding control: Business downgrade

With the DDL deadline approaching, the team disabled the new‑user‑first‑order‑red‑packet feature, the only variable, which gradually restored the order success rate.

Why recovery was slow? Even after disabling the feature, threads stuck in infinite loops needed to timeout or be replaced by health checks before resources were freed.

14:20 – Arthas appears

Suspecting dead loops, I used Arthas to inspect real‑time CPU usage of the suspect threads.

Act 2: The Sword of Arthas – Three Powerful Commands

After obtaining SRE approval, I attached Arthas to the problematic server.

First command: thread -n 3

Identified three RUNNABLE threads with ~100% CPU, all stuck in HashBiMap.seekByKey calls.

// Example output
"biz-thread-1" prio=5 tid=0x00007f8c9a0b8000 state=RUNNABLE cpu_usage=99.99%
    at com.google.common.collect.HashBiMap.seekByKey(HashBiMap.java:159)
    at com.google.common.collect.HashBiMap.put(HashBiMap.java:109)
    ...
"biz-thread-2" prio=5 tid=0x00007f8c9a0b9800 state=RUNNABLE cpu_usage=99.98%
    at com.google.common.collect.HashBiMap.seekByKey(HashBiMap.java:159)
    ...
"main" prio=5 tid=0x00007f8d1c009000 state=TIMED_WAITING cpu_usage=0.01%
    ...

Second command: jad & stack

Decompiled HashBiMap and inspected stack traces, confirming the issue stemmed from a non‑thread‑safe cache used in high concurrency.

jstack only shows RUNNABLE, not BLOCKED, indicating busy‑spinning loops.

Third command: tt & ognl

Captured a snapshot of the xxxManager.syncUserCache method with tt -t, then used an OGNL expression to traverse the internal HashBiMap structure and reveal a cyclic reference.

# tt -t com.xxx.service.xxxManager syncUserCache -n 1
# tt -i 1001 -w '#[email protected]@getApplicationContext(), #xxxManager=#context.getBean("xxxManager"), #biMap=#xxxManager.xxCache, #table=#biMap.table, #entry=#table[15], {#entry.key, #entry.next.key, #entry.next.next.key, #entry.next.next.next.key}' -x 4

The OGNL output displayed a looped list, confirming the hash bucket’s chain was cyclic.

Ironclad evidence: the logic chain closed.

Act 3: Sheathing the Sword – From Fire‑fighting to Fire‑prevention

Root cause

The culprit was a static global cache using HashBiMap without synchronization, a three‑year‑old technical debt that broke under the promotion’s massive concurrent writes.

public class XxxManager {
    // Non‑thread‑safe
    private static final BiMap<Long, UserInfo> userCache = HashBiMap.create();

    public void xxCache(UserInfo newUser) {
        userCache.forcePut(newUser.getUserId(), newUser);
    }
}

Why load testing failed

Pre‑promotion tests used only existing users, never simulating the burst of new‑user cache writes that triggered the bug.

Long‑term improvements

Code fix: Replace HashBiMap with ConcurrentHashMap and refactor related logic.

Process improvement: Add thread‑safety checks to CR workflows and static analysis rules for unsafe collection usage in singletons.

Lesson: Expert value lies in knowing each tool’s limits—when jstack misdiagnoses and Arthas shines.
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

debuggingjavaconcurrencyincident responseThread PoolArthasHashBiMap
Rare Earth Juejin Tech Community
Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.