How a Targeted Java Refactor Delivered a 10× Performance Boost

By profiling a three‑year‑old order service and applying six data‑driven optimizations—including log reduction, object‑allocation cuts, HashMap replacement, Java 21 virtual threads, JSON caching, and ZGC tuning—the team achieved a 9.5× throughput increase and a ten‑fold drop in P99 latency.

LuTiao Programming
LuTiao Programming
LuTiao Programming
How a Targeted Java Refactor Delivered a 10× Performance Boost

Baseline

The order‑processing service handled thousands of requests per minute with the following symptoms:

P99 latency ≈ 800 ms

High memory usage

Frequent Minor GC pauses

Occasional thread‑pool saturation during peak load

Bottleneck analysis

Using async-profiler (Flame Graph) and Java Flight Recorder the team identified:

≈ 40 % of CPU time spent on string concatenation and logging

Object allocation rate ≈ 2 GB/min

Unnecessary synchronized blocks on several hot paths

Each request deserialized a JSON configuration object

These data‑driven findings guided the subsequent refactor.

Optimization 1 – Remove eager logging

Original code performed eager string concatenation even when DEBUG was disabled:

logger.debug("Processing order: " + order.toString());

Replaced with SLF4J lazy evaluation and removal of non‑essential logs:

logger.debug("Processing order: {}", order::toString);

Result: hot‑path CPU usage reduced by ~15 %.

Optimization 2 – Reduce object allocation

Original Stream pipeline created many intermediate objects:

List<Result> results = orders.stream()
    .map(this::transform)
    .filter(Objects::nonNull)
    .collect(Collectors.toList());

Rewritten as a hand‑written loop that reuses a pre‑sized buffer and introduces a lightweight object pool for high‑frequency objects:

List<Result> results = new ArrayList<>(orders.size());
for (Order order : orders) {
    Result r = transform(order);
    if (r != null) {
        results.add(r);
    }
}

JFR showed allocation dropping from ~2 GB/min to ~380 MB/min, GC frequency down 70 %, and noticeable improvement in P99 latency stability.

Optimization 3 – Primitive map to avoid boxing

Original cache used Map<Integer, Long>, incurring boxing of primitive keys and values:

Map<Integer, Long> cache = new HashMap<>();

Replaced with Eclipse Collections’ primitive map: MutableIntLongMap cache = new IntLongHashMap(); Result: cache‑intensive request throughput increased by ~20 %.

Optimization 4 – Java 21 virtual threads

Upgraded runtime from Java 17 to Java 21 and swapped the fixed thread pool:

ExecutorService executor = Executors.newFixedThreadPool(200);

with a virtual‑thread executor:

ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor();

In I/O‑bound workloads this removed the concurrency ceiling, yielding a 3× peak‑throughput increase, eliminating thread‑pool exhaustion, and removing the need for pool‑size tuning.

Optimization 5 – Cache JSON deserialization

Added a 60‑second TTL cache with double‑checked locking to avoid deserializing the configuration on every request:

private volatile CachedConfig cachedConfig;
private volatile long cacheTimestamp;

private CachedConfig getConfig() {
    long now = System.currentTimeMillis();
    if (now - cacheTimestamp > 60_000) {
        synchronized (this) {
            if (now - cacheTimestamp > 60_000) {
                cachedConfig = deserialize(fetchRaw());
                cacheTimestamp = now;
            }
        }
    }
    return cachedConfig;
}

Result: average per‑request latency reduced by ~5 ms.

Optimization 6 – GC tuning (G1 → ZGC)

After lowering allocation pressure, the GC was switched to ZGC with a fixed 4 GB heap:

-Xms4g
-Xmx4g
-XX:+UseZGC

ZGC delivered sub‑millisecond pauses; the maximum GC pause fell from 340 ms to 8 ms, and latency spikes virtually disappeared.

Final metrics

P50 latency: 120 ms → 18 ms

P99 latency: 800 ms → 75 ms

Throughput: 1,200 req/s → 11,400 req/s

Allocation rate: ~2 GB/min → ~380 MB/min

Max GC pause: 340 ms → 8 ms

Key observations

Performance work driven by profiler data, not intuition.

Largest gains came from seemingly trivial places: logging, object allocation, caching, and GC strategy.

Upgrading to Java 21 and adopting virtual threads provided the single biggest benefit with minimal code changes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Javaperformance optimizationzgcvirtual-threadsProfilingGC TuningEclipse Collections
LuTiao Programming
Written by

LuTiao Programming

LuTiao Programming is a friendly community offering free programming lessons. We inspire learners to explore new ideas and technologies and quickly acquire job-ready skills.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.