Backend Development 10 min read

Why Is My Java Service Stalling? Uncovering GC, Safepoint, and Log4j2 Bottlenecks

After weeks of investigation, this post details how intermittent request timeouts in a high‑concurrency Java service were traced to frequent JVM stop‑the‑world pauses caused by GC, safepoint logging, biased lock revocation, and Log4j2’s synchronous logging, and outlines the steps taken to diagnose and resolve the issue.

Efficient Ops
Efficient Ops
Efficient Ops
Why Is My Java Service Stalling? Uncovering GC, Safepoint, and Log4j2 Bottlenecks

After two weeks of debugging a high‑concurrency Java service that occasionally timed out, I discovered that long pauses between logging the HTTP client response and the subsequent JSON parsing were caused by JVM stop‑the‑world (STW) events.

GC

Possible causes considered were application locks, JVM GC pauses, and system overload; the first two were ruled out, leaving GC as the main suspect.

Application locks were excluded because the JSON parsing code itself does not use locks.

JVM GC can trigger STW pauses.

System load was low, so OS scheduling issues were dismissed.

Using

-XX:+PrintGCApplicationStoppedTime

we enabled logging of all STW events, not just those caused by GC. The generated gclog revealed frequent long STW pauses, sometimes lasting dozens of milliseconds, which could easily cause the observed 100‑700 ms gaps.

Safepoint and Biased Locking

Safepoint Logs

Frequent STW prompted an investigation of safepoint logs. By starting the JVM with

-XX:+UnlockDiagnosticVMOptions -XX:+PrintSafepointStatistics -XX:+LogVMOutput -XX:LogFile=./safepoint.log

, safepoint details were written to

safepoint.log

.

The log contains entries such as the one shown below, where the

vmopt

column indicates the reason

RevokeBias

, i.e., biased‑lock revocation.

Biased Locking

Biased locking is a JVM optimization that assumes a lock will be repeatedly acquired by the same thread, avoiding expensive OS locking. When contention occurs, the lock must be revoked at a global safepoint, which can be costly in highly concurrent workloads.

Disabling biased locking with

-XX:-UseBiasedLocking

reduced the pause frequency by about half, but the problem persisted.

Log4j2

Root Cause

We hypothesized three possible culprits: HttpClient, Hystrix, and Log4j2. By replacing the third‑party call with fixed data, removing Hystrix, and keeping Log4j2, the issue remained, confirming Log4j2 as the source of the stalls.

Using BTrace to Profile Log4j2 Locks

Log4j2 contains three lock‑protected methods:

rollover()

– locks during log file rotation.

encodeText()

– synchronizes conversion of log messages to byte arrays.

flush()

– synchronizes to avoid interleaved output.

Injecting timing probes with BTrace showed that

encodeText()

was the slowest during load, but its code itself showed no obvious defects.

Analyzing with JMC

Using Java Mission Control (JMC) we captured a long‑running native call:

<code>environment:
   - JFR=true
   - JMX_PORT=port
   - JMX_HOST=ip
   - JMX_LOGIN=user:pwd
</code>

The

RandomAccessFile.write()

call took over 1 second, matching the thread ID observed in the pause, suggesting the bottleneck was in synchronous logging.

Resolution

Reducing the amount of synchronous logging and switching to Log4j2’s asynchronous appender eliminated the pauses.

Takeaways

Analyze multiple cases to find common patterns and avoid chasing false leads.

Reproduce the issue in a controlled environment that mirrors production.

Form hypotheses based on recent changes and compare across services.

Use elimination to isolate the variable causing the problem.

Apply a simple fix once the root cause is identified.

Support conclusions with quantitative data.

JavaPerformanceSafepointGClog4j2BTrace
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.