Backend Development 10 min read

Why Your Java Service Hangs: Uncovering GC, Safepoint, and Log4j2 Bottlenecks

In a high‑concurrency Java service, intermittent timeouts were traced to long JVM safepoint pauses caused by GC, biased‑lock revocation, and Log4j2 synchronization, and the investigation shows how to diagnose and resolve these performance stalls.

Java Interview Crash Guide

Jun 17, 2021

Why Your Java Service Hangs: Uncovering GC, Safepoint, and Log4j2 Bottlenecks

GC

In a typical high‑concurrency scenario an interface occasionally timed out; logs showed a large gap (100‑700 ms) between the HTTP client request and JSON parsing, which should take less than 1 ms.

Possible causes considered were application locks (ruled out), JVM GC causing stop‑the‑world (STW), and system overload (ruled out by low load metrics).

Using jstat revealed infrequent full GC and normal minor GC intervals, but the JVM was started with -XX:+PrintGCApplicationStoppedTime, which logs all STW events, not just GC.

GClog analysis showed frequent, long STW pauses, sometimes occurring back‑to‑back, which could explain the timeouts.

Safepoint and Biased Locking

Safepoint Logs

Safepoint logs record the time spent entering and exiting STW and the steps consuming time. Enabling them with

-XX:+UnlockDiagnosticVMOptions -XX:+PrintSafepointStatistics -XX:+LogVMOutput -XX:LogFile=./safepoint.log

produced logs like the one below.

The logs indicated that the STW reason was RevokeBias, i.e., releasing a biased lock.

Biased Lock

Biased locking optimizes uncontended locks by biasing them toward the first acquiring thread, avoiding expensive atomic operations. The lock is released only when contention occurs, which requires a global safepoint, adding overhead in highly concurrent workloads.

Disabling biased locking with -XX:-UseBiasedLocking reduced pause frequency by half, but the problem persisted.

Log4j2

Root Cause Identification

By isolating components (HttpClient, Hystrix, Log4j2) and replacing third‑party responses with fixed data, the issue was reproduced only when Log4j2 was active, pointing to its internal locking.

Lock Analysis with BTrace

Three Log4j2 methods contain locks: rollover(), encodeText() (synchronized), and flush(). Using BTrace to instrument these methods showed that encodeText() incurred the longest execution time during the pause.

JMC Investigation

Enabling JFR in Docker and analyzing events revealed a 1063 ms pause in RandomAccessFile.write(), a native call that likely contributed to the STW.

Resolution

Reduce log volume; excessive logging can trigger the pauses.

Switch to asynchronous Log4j2 logging to avoid blocking on I/O.

Summary

The investigation highlighted a systematic debugging approach: collect more cases, reproduce in a controlled environment, form hypotheses based on recent changes, use elimination to isolate variables, and finally apply a targeted fix.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Debugging Java JVM performance gc

Written by

Java Interview Crash Guide

Dedicated to sharing Java interview Q&A; follow and reply "java" to receive a free premium Java interview guide.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.