Why Your Java Service Hangs: Uncovering GC, Safepoint, and Log4j2 Bottlenecks
In a high‑concurrency Java service, intermittent timeouts were traced to long JVM safepoint pauses caused by GC, biased‑lock revocation, and Log4j2 synchronization, and the investigation shows how to diagnose and resolve these performance stalls.
GC
In a typical high‑concurrency scenario an interface occasionally timed out; logs showed a large gap (100‑700 ms) between the HTTP client request and JSON parsing, which should take less than 1 ms.
Possible causes considered were application locks (ruled out), JVM GC causing stop‑the‑world (STW), and system overload (ruled out by low load metrics).
Using jstat revealed infrequent full GC and normal minor GC intervals, but the JVM was started with -XX:+PrintGCApplicationStoppedTime, which logs all STW events, not just GC.
GClog analysis showed frequent, long STW pauses, sometimes occurring back‑to‑back, which could explain the timeouts.
Safepoint and Biased Locking
Safepoint Logs
Safepoint logs record the time spent entering and exiting STW and the steps consuming time. Enabling them with
-XX:+UnlockDiagnosticVMOptions -XX:+PrintSafepointStatistics -XX:+LogVMOutput -XX:LogFile=./safepoint.logproduced logs like the one below.
The logs indicated that the STW reason was RevokeBias, i.e., releasing a biased lock.
Biased Lock
Biased locking optimizes uncontended locks by biasing them toward the first acquiring thread, avoiding expensive atomic operations. The lock is released only when contention occurs, which requires a global safepoint, adding overhead in highly concurrent workloads.
Disabling biased locking with -XX:-UseBiasedLocking reduced pause frequency by half, but the problem persisted.
Log4j2
Root Cause Identification
By isolating components (HttpClient, Hystrix, Log4j2) and replacing third‑party responses with fixed data, the issue was reproduced only when Log4j2 was active, pointing to its internal locking.
Lock Analysis with BTrace
Three Log4j2 methods contain locks: rollover(), encodeText() (synchronized), and flush(). Using BTrace to instrument these methods showed that encodeText() incurred the longest execution time during the pause.
JMC Investigation
Enabling JFR in Docker and analyzing events revealed a 1063 ms pause in RandomAccessFile.write(), a native call that likely contributed to the STW.
Resolution
Reduce log volume; excessive logging can trigger the pauses.
Switch to asynchronous Log4j2 logging to avoid blocking on I/O.
Summary
The investigation highlighted a systematic debugging approach: collect more cases, reproduce in a controlled environment, form hypotheses based on recent changes, use elimination to isolate variables, and finally apply a targeted fix.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Interview Crash Guide
Dedicated to sharing Java interview Q&A; follow and reply "java" to receive a free premium Java interview guide.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
