Uncovering Hidden Performance Bottlenecks: From GC Issues to Netty Thread Blocking
This article walks through a real‑world performance investigation, covering CMS GC misconfigurations, biased‑lock revocation delays, Netty worker thread queuing, and log‑framework overhead, while providing concrete JVM flags, code snippets, and diagnostic steps to pinpoint and resolve latency spikes.
Preface
In Case 1 the author described the troubleshooting process for a CMS GC problem that turned out to be caused by a c3p0 configuration error. The article then lists a set of typical performance issues that often require deeper analysis.
Excessive biased‑lock revocation time
Incorrect Netty usage causing blocking in non‑blocking I/O threads
Poor performance of the machine hosting Redis
GC‑induced timeouts that need memory‑snapshot analysis
HTTP/1.1 head‑of‑line blocking
Thread‑pool tasks blocked by queuing (e.g., Hystrix dashboard can help)
Threads unable to obtain CPU time slices
Proxy (e.g., Nginx) performance or configuration issues
The article focuses on a few representative problems.
Problem Discovery: Client Runtime Significantly Higher Than Server
In Case 1 we solved a rapid old‑generation growth caused by a wrong c3p0 setting, which led to long CMS GC pauses. Monitoring continued to show the same symptom.
The AB communication mechanism is simple: A uses HttpClient to send requests, B uses Netty as the HTTP server. After A calls B, B does not start processing immediately. Based on the experience from Case 1, the following possible causes are prioritized.
B hits a stop‑the‑world event? Since the GC issue was already solved, we focus on other JVM operations that may cause timeouts. For example, GC triggers a stop‑the‑world by forcing all threads to reach a safepoint; revoking biased locks or debugging also requires a safepoint.
Add the following JVM parameters to capture safepoint information:
-XX:+SafepointTimeout
-XX:SafepointTimeoutDelay=500
-XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=1
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCTimeStampsOther scenarios such as Netty, HttpClient, or Hystrix queuing may also cause timeouts (see Case 1 for details).
Problem 1: Irregular Log Output Disrupts Debugging
After adding safepoint‑related logs, only part of the logs were visible during troubleshooting. The root cause was that an operational script redirected stdout to a Logback‑managed rolling file.
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<fileNamePattern>server.%d{yyyy-MM-dd}.log</fileNamePattern>
<maxHistory>15</maxHistory>
</rollingPolicy>When the service restarts (e.g., 2020‑06‑10 08:00), Logback renames server.log to server.2020-06-10.log and creates a new server.log. Subsequent stdout redirection writes to the renamed file, causing logs to be split across files.
Recommended practice:
Avoid writing logs to stdout in production.
Do not configure any logging framework to output to stdout.
If stdout must be used (e.g., JVM internal logs), separate its output file and consider appending timestamps to avoid name collisions.
Problem 2: Biased‑Lock Revocation Causing Timeouts
GC logs showed a 1.5 s stop‑the‑world pause, and safepoint logs displayed RevokeBias taking ~1542 ms.
Since no Java thread was identified as the culprit, the issue was attributed to internal JVM operations. To eliminate the cost, the following flag disables biased locking:
-XX:-UseBiasedLockingProblem 3: Netty Worker Thread Blocking
Even after disabling biased locking, timeouts persisted. The investigation revealed that Netty’s worker threads sometimes processed blocking business logic.
The worker thread pool size is calculated as:
private int workerThreadSize = Runtime.getRuntime().availableProcessors() * 2 - 1;On a machine with 48 logical cores, this yields 95 worker threads, which reduces but does not eliminate queuing. The correct approach is to separate business‑logic execution into its own thread pool, keeping Netty workers dedicated to non‑blocking I/O.
Problem 4: Log‑Framework Performance Overhead
At midnight, runtime spikes were observed. Profiling with SkyWalking pointed to Logback’s daily rolling operation. The relevant Logback configuration:
<file>${server_log_home}/${server_name}.log</file>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<fileNamePattern>${server_log_home}/${server_name}-%d{yyyy-MM-dd}.log</fileNamePattern>
<maxHistory>15</maxHistory>
</rollingPolicy>When the clock hits 00:00, Logback acquires a lock, creates a new log file, renames the previous one, and deletes old files. Deleting large log directories (≈100 GB) can take several seconds, causing noticeable latency. Upgrading to Logback 1.2.3, which performs asynchronous deletion, resolves the issue.
Summary
The three representative performance problems—CMS GC pauses, biased‑lock revocation, and Netty worker‑thread blocking—can coexist and all manifest as large upstream‑downstream latency differences. They share a common root: application‑level blocking that forces the OS‑level TCP buffer to hold data while the JVM pauses. Additionally, log‑framework overhead can be a hidden source of latency, which can be mitigated by using asynchronous log rolling and proper logging practices.
Sohu Smart Platform Tech Team
The Sohu News app's technical sharing hub, offering deep tech analyses, the latest industry news, and fun developer anecdotes. Follow us to discover the team's daily joys.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
