Backend Development 18 min read

Uncovering Hidden Performance Bottlenecks: From GC Issues to Netty Thread Blocking

This article walks through a real‑world performance investigation, covering CMS GC misconfigurations, biased‑lock revocation delays, Netty worker thread queuing, and log‑framework overhead, while providing concrete JVM flags, code snippets, and diagnostic steps to pinpoint and resolve latency spikes.

Sohu Smart Platform Tech Team

Aug 11, 2025

Uncovering Hidden Performance Bottlenecks: From GC Issues to Netty Thread Blocking

Preface

In Case 1 the author described the troubleshooting process for a CMS GC problem that turned out to be caused by a c3p0 configuration error. The article then lists a set of typical performance issues that often require deeper analysis.

Excessive biased‑lock revocation time

Incorrect Netty usage causing blocking in non‑blocking I/O threads

Poor performance of the machine hosting Redis

GC‑induced timeouts that need memory‑snapshot analysis

HTTP/1.1 head‑of‑line blocking

Thread‑pool tasks blocked by queuing (e.g., Hystrix dashboard can help)

Threads unable to obtain CPU time slices

Proxy (e.g., Nginx) performance or configuration issues

The article focuses on a few representative problems.

Problem Discovery: Client Runtime Significantly Higher Than Server

In Case 1 we solved a rapid old‑generation growth caused by a wrong c3p0 setting, which led to long CMS GC pauses. Monitoring continued to show the same symptom.

The AB communication mechanism is simple: A uses HttpClient to send requests, B uses Netty as the HTTP server. After A calls B, B does not start processing immediately. Based on the experience from Case 1, the following possible causes are prioritized.

B hits a stop‑the‑world event? Since the GC issue was already solved, we focus on other JVM operations that may cause timeouts. For example, GC triggers a stop‑the‑world by forcing all threads to reach a safepoint; revoking biased locks or debugging also requires a safepoint.

Add the following JVM parameters to capture safepoint information:

-XX:+SafepointTimeout
-XX:SafepointTimeoutDelay=500
-XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=1
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCTimeStamps

Other scenarios such as Netty, HttpClient, or Hystrix queuing may also cause timeouts (see Case 1 for details).

Problem 1: Irregular Log Output Disrupts Debugging

After adding safepoint‑related logs, only part of the logs were visible during troubleshooting. The root cause was that an operational script redirected stdout to a Logback‑managed rolling file.

<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
  <fileNamePattern>server.%d{yyyy-MM-dd}.log</fileNamePattern>
  <maxHistory>15</maxHistory>
</rollingPolicy>

When the service restarts (e.g., 2020‑06‑10 08:00), Logback renames server.log to server.2020-06-10.log and creates a new server.log. Subsequent stdout redirection writes to the renamed file, causing logs to be split across files.

Recommended practice:

Avoid writing logs to stdout in production.

Do not configure any logging framework to output to stdout.

If stdout must be used (e.g., JVM internal logs), separate its output file and consider appending timestamps to avoid name collisions.

Problem 2: Biased‑Lock Revocation Causing Timeouts

GC logs showed a 1.5 s stop‑the‑world pause, and safepoint logs displayed RevokeBias taking ~1542 ms.

Since no Java thread was identified as the culprit, the issue was attributed to internal JVM operations. To eliminate the cost, the following flag disables biased locking:

-XX:-UseBiasedLocking

Problem 3: Netty Worker Thread Blocking

Even after disabling biased locking, timeouts persisted. The investigation revealed that Netty’s worker threads sometimes processed blocking business logic.

The worker thread pool size is calculated as:

private int workerThreadSize = Runtime.getRuntime().availableProcessors() * 2 - 1;

On a machine with 48 logical cores, this yields 95 worker threads, which reduces but does not eliminate queuing. The correct approach is to separate business‑logic execution into its own thread pool, keeping Netty workers dedicated to non‑blocking I/O.

Problem 4: Log‑Framework Performance Overhead

At midnight, runtime spikes were observed. Profiling with SkyWalking pointed to Logback’s daily rolling operation. The relevant Logback configuration:

<file>${server_log_home}/${server_name}.log</file>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
  <fileNamePattern>${server_log_home}/${server_name}-%d{yyyy-MM-dd}.log</fileNamePattern>
  <maxHistory>15</maxHistory>
</rollingPolicy>

When the clock hits 00:00, Logback acquires a lock, creates a new log file, renames the previous one, and deletes old files. Deleting large log directories (≈100 GB) can take several seconds, causing noticeable latency. Upgrading to Logback 1.2.3, which performs asynchronous deletion, resolves the issue.

Summary

The three representative performance problems—CMS GC pauses, biased‑lock revocation, and Netty worker‑thread blocking—can coexist and all manifest as large upstream‑downstream latency differences. They share a common root: application‑level blocking that forces the OS‑level TCP buffer to hold data while the JVM pauses. Additionally, log‑framework overhead can be a hidden source of latency, which can be mitigated by using asynchronous log rolling and proper logging practices.

JVM Performance microservices Netty logging gc

Written by

Sohu Smart Platform Tech Team

The Sohu News app's technical sharing hub, offering deep tech analyses, the latest industry news, and fun developer anecdotes. Follow us to discover the team's daily joys.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.