Midnight TODO That Nearly Crashed the Whole Department: A JVM Performance Tuning Case Study

During a midnight promotion launch, a forgotten TODO caused thread‑pool exhaustion and frequent Full GC, bringing down an e‑commerce service; the article presents a five‑step end‑to‑end JVM tuning methodology, from data collection to root‑cause verification and code fix, showing how to diagnose and resolve such incidents.

Tech Freedom Circle
Tech Freedom Circle
Tech Freedom Circle
Midnight TODO That Nearly Crashed the Whole Department: A JVM Performance Tuning Case Study

Incident Overview

A major e‑commerce platform prepared a "S‑level" flash‑sale event that was to start at midnight. Within seconds of launch, alerts flooded the monitoring channel: service availability dropped below 10%, thread‑pool active threads exceeded 95%, and CPU load surged above 8.0. The promotion service became effectively unavailable.

Five‑Step End‑to‑End JVM Tuning Methodology

The author proposes a systematic workflow (illustrated in the diagram below) that emphasizes data‑driven analysis to avoid blind guesses.

JVM tuning workflow
JVM tuning workflow

Step 1: Data Collection – Capture the Full Symptom

GC logs : Enable with -Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps. Focus on Full GC frequency, pause time, and post‑GC memory reclamation.

Heap dump : Generate via jmap -dump:format=b,file=heap.hprof <pid> or automatically on OOM with -XX:+HeapDumpOnOutOfMemoryError.

Thread dump : Capture with jstack <pid> > thread.txt, preferably every 5‑10 seconds to observe state changes.

Live runtime metrics : Use jstat -gcutil <pid> 1000 to print memory region usage and GC activity each second.

Step 2: Thread Analysis – Locate Concurrency Bottlenecks

Count thread states; pay special attention to BLOCKED and WAITING threads.

Detect deadlocks with jstack <pid> | grep -i deadlock or automated tools.

Identify hot threads via top -H -p <pid>, convert the thread ID to hexadecimal, and locate the corresponding stack in the thread dump.

# Count thread states
$ grep "java.lang.Thread.State" thread_dump.txt | sort | uniq -c
 15 java.lang.Thread.State: BLOCKED (on object monitor)
 32 java.lang.Thread.State: RUNNABLE
  8 java.lang.Thread.State: WAITING (on object monitor)

Step 3: Log Parsing – Extract Effective Clues

Analyze business logs to filter out irrelevant noise.

Examine GC logs for Full GC frequency and pause duration; check trigger reasons such as Allocation Failure or Metadata GC Threshold.

Search thread dumps for the keyword "deadlock" to quickly rule out deadlocks.

Use jmap -histo:live <pid> | head -n 20 to spot the most numerous or largest object types.

Step 4: Heap Dump Analysis – Reveal Memory Truth

When a memory leak is suspected, open the heap dump with MAT or JVisualVM.

Inspect the Dominator Tree to list objects occupying the most space and view their reference chains.

Check the Leak Suspects Report for global collections (e.g., static Map), unclosed resources, or lingering ThreadLocal entries.

Step 5: Root‑Cause Verification – Draw Conclusions

Correlate evidence from the previous steps to build a complete chain:

GC clue : Frequent Full GC with minimal reclaimed space in the old generation.

Thread clue : Thousands of threads blocked on a single cache operation.

Heap clue : A static HashMap consumes ~80% of heap; its entries lack expiration.

Conclusion : An unbounded cache without eviction caused a memory leak, leading to GC pressure and thread‑pool exhaustion.

Three‑Stage Resolution

Stage 1 – Emergency Stop‑Bleed (1‑3 hours)

Restart the most loaded instances after isolating them from the load balancer.

Temporarily add 20 new machines to disperse traffic.

Disable non‑essential features or apply rate limiting to protect core pathways.

Stage 2 – Local Optimization (1‑3 days)

Identify the code‑level root cause and fix it.

public void updateActivityXxxCache(Long sellerId, List<XxxDO> xxxDOList) {
    try {
        if (CollectionUtils.isEmpty(xxxDOList)) {
            xxxDOList = new ArrayList<>();
        }
        // Original problematic code – serialization inside the loop
        // for (int i = 0; i < PARTITIONS; i++) {
        //     RedisCache.put(key(i), JSON.toJSONString(xxxDOList), EXPIRE);
        // }
        // Optimized version – serialize once outside the loop
        String json = JSON.toJSONString(xxxDOList);
        for (int i = 0; i < PARTITIONS; i++) {
            RedisCache.put(key(i), json, EXPIRE);
        }
    } catch (Exception e) {
        log.warn("update cache exception occur", e);
    }
}

The fix moves the heavy JSON.toJSONString call out of the loop, reducing serialization from 20 times per request to a single execution.

Adjust JVM heap size with -Xms and -Xmx set to the same value.

Choose an appropriate GC collector (e.g., G1GC for JDK 9+, ZGC for ultra‑low latency).

Tune young generation size via -Xmn when object allocation is frequent.

Stage 3 – Architecture Upgrade (1‑3 months)

Replace local JVM caches with a robust distributed cache such as Redis to avoid JVM memory limits.

Deploy a Prometheus + Grafana stack to monitor core JVM metrics (GC count, pause time, memory usage) and set alerts.

Introduce code‑block‑level APM (e.g., SkyWalking) to pinpoint slow methods like XxxxxCacheManager.update.

Incorporate regular performance stress tests and chaos‑engineered fault injection before major releases.

Enforce coding standards and code‑review checks to prevent hidden memory‑leak patterns.

Key Takeaways (Three Laws)

Never optimize without capacity assessment. An “improved” cache design became a disaster because its size and serialization cost were never evaluated.

Monitoring must reach the code‑block level. Knowing that 90 % of latency resides in XxxxxCacheManager.update cuts debugging time dramatically.

Technical debt will explode when you least expect it. A single forgotten TODO turned into a CPU‑bound “meat grinder” that crippled the whole system.

By following the systematic e2e JVM tuning workflow, the team was able to identify the root cause, apply a targeted code fix, and implement longer‑term architectural safeguards, turning a near‑catastrophic outage into a learning opportunity.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaJVMMonitoringPerformance TuningThread PoolFull GCHeap Dump
Tech Freedom Circle
Written by

Tech Freedom Circle

Crazy Maker Circle (Tech Freedom Architecture Circle): a community of tech enthusiasts, experts, and high‑performance fans. Many top‑level masters, architects, and hobbyists have achieved tech freedom; another wave of go‑getters are hustling hard toward tech freedom.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.