Why Is My Java Service Stalling? Uncovering GC, Safepoint, and Log4j2 Bottlenecks
In a high‑concurrency Java service, occasional request timeouts were traced to long pauses between log entries, leading to an investigation that revealed frequent JVM stop‑the‑world events caused by GC, safepoint‑related biased‑lock revocations, and Log4j2 locking issues.
A high‑concurrency Java service occasionally timed out because the gap between the log after an HTTP client call (A) and the log after JSON parsing (B) was unusually large, ranging from 100 ms to 700 ms.
GC
Possible causes considered were application locks, JVM GC causing stop‑the‑world (STW), and system load; the latter was ruled out.
Application lock – excluded because JSON parsing itself is lock‑free.
JVM GC – could trigger STW.
System overload – monitoring showed low load.
Using jstat showed infrequent full GC and normal minor GC intervals, with -XX:+PrintGCApplicationStoppedTime enabled to record all STW events in the GC log.
The GC log revealed frequent, long STW pauses, sometimes less than 1 ms apart, causing cumulative hangs of over 120 ms.
Safepoint and Biased Lock
Safepoint Logs
STW occurs when all threads reach a safepoint; the safepoint log records entry and exit times, helping identify the cause.
Enabling safepoint logging with
-XX:+UnlockDiagnosticVMOptions -XX:+PrintSafepointStatistics -XX:+LogVMOutput -XX:LogFile=./safepoint.logproduced logs like the following:
The "vmopt" column showed the reason RevokeBias, indicating a biased‑lock revocation.
Biased Lock
Biased locking optimizes uncontended locks by biasing them toward the first acquiring thread; the lock is only revoked when contention occurs, which requires a safepoint and can be costly under high concurrency.
Disabling biased locking with -XX:-UseBiasedLocking reduced pause frequency by half, but some pauses remained.
Log4j2
Investigation
Potential culprits (HttpClient, Hystrix, Log4j2) were isolated; replacing third‑party responses and removing Hystrix still reproduced the issue, pinpointing Log4j2.
Using btrace to probe Log4j2 locks
Three locking points in Log4j2 were identified: rollover() – locks during log file rotation. encodeText() – synchronizes character‑set conversion for large logs. flush() – synchronizes to preserve log order.
Instrumenting these methods with btrace showed that encodeText() incurred the longest execution time during load tests.
Using JMC analysis
environment:
- JFR=true
- JMX_PORT=port
- JMX_HOST=ip
- JMX_LOGIN=user:pwdJMC captured a 1063 ms pause in RandomAccessFile.write(), matching the thread ID observed in the STW logs, suggesting a native I/O bottleneck, possibly Docker‑related.
Solution
Reduce log volume; excessive logging amplifies pauses.
Switch to Log4j2 asynchronous logging (accepting possible loss on buffer overflow or restart).
Checklist Summary
Collect multiple failure cases to identify common patterns and avoid false leads.
Reproduce the issue in a controlled environment that mirrors production.
Compare recent changes and hypothesize causes.
Use elimination: vary one variable at a time to see its impact.
Apply the fix—often a single configuration or code change.
Support findings with quantitative data to convince stakeholders.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Interview Crash Guide
Dedicated to sharing Java interview Q&A; follow and reply "java" to receive a free premium Java interview guide.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
