Backend Development 8 min read

How a Targeted Java Refactor Delivered a 10× Performance Boost

By profiling a three‑year‑old order service and applying six data‑driven optimizations—including log reduction, object‑allocation cuts, HashMap replacement, Java 21 virtual threads, JSON caching, and ZGC tuning—the team achieved a 9.5× throughput increase and a ten‑fold drop in P99 latency.

LuTiao Programming

Mar 2, 2026

How a Targeted Java Refactor Delivered a 10× Performance Boost

Baseline

The order‑processing service handled thousands of requests per minute with the following symptoms:

P99 latency ≈ 800 ms

High memory usage

Frequent Minor GC pauses

Occasional thread‑pool saturation during peak load

Bottleneck analysis

Using async-profiler (Flame Graph) and Java Flight Recorder the team identified:

≈ 40 % of CPU time spent on string concatenation and logging

Object allocation rate ≈ 2 GB/min

Unnecessary synchronized blocks on several hot paths

Each request deserialized a JSON configuration object

These data‑driven findings guided the subsequent refactor.

Optimization 1 – Remove eager logging

Original code performed eager string concatenation even when DEBUG was disabled:

logger.debug("Processing order: " + order.toString());

Replaced with SLF4J lazy evaluation and removal of non‑essential logs:

logger.debug("Processing order: {}", order::toString);

Result: hot‑path CPU usage reduced by ~15 %.

Optimization 2 – Reduce object allocation

Original Stream pipeline created many intermediate objects:

List<Result> results = orders.stream()
    .map(this::transform)
    .filter(Objects::nonNull)
    .collect(Collectors.toList());

Rewritten as a hand‑written loop that reuses a pre‑sized buffer and introduces a lightweight object pool for high‑frequency objects:

List<Result> results = new ArrayList<>(orders.size());
for (Order order : orders) {
    Result r = transform(order);
    if (r != null) {
        results.add(r);
    }
}

JFR showed allocation dropping from ~2 GB/min to ~380 MB/min, GC frequency down 70 %, and noticeable improvement in P99 latency stability.

Optimization 3 – Primitive map to avoid boxing

Original cache used Map<Integer, Long>, incurring boxing of primitive keys and values:

Map<Integer, Long> cache = new HashMap<>();

Replaced with Eclipse Collections’ primitive map: MutableIntLongMap cache = new IntLongHashMap(); Result: cache‑intensive request throughput increased by ~20 %.

Optimization 4 – Java 21 virtual threads

Upgraded runtime from Java 17 to Java 21 and swapped the fixed thread pool:

ExecutorService executor = Executors.newFixedThreadPool(200);

with a virtual‑thread executor:

ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor();

In I/O‑bound workloads this removed the concurrency ceiling, yielding a 3× peak‑throughput increase, eliminating thread‑pool exhaustion, and removing the need for pool‑size tuning.

Optimization 5 – Cache JSON deserialization

Added a 60‑second TTL cache with double‑checked locking to avoid deserializing the configuration on every request:

private volatile CachedConfig cachedConfig;
private volatile long cacheTimestamp;

private CachedConfig getConfig() {
    long now = System.currentTimeMillis();
    if (now - cacheTimestamp > 60_000) {
        synchronized (this) {
            if (now - cacheTimestamp > 60_000) {
                cachedConfig = deserialize(fetchRaw());
                cacheTimestamp = now;
            }
        }
    }
    return cachedConfig;
}

Result: average per‑request latency reduced by ~5 ms.

Optimization 6 – GC tuning (G1 → ZGC)

After lowering allocation pressure, the GC was switched to ZGC with a fixed 4 GB heap:

-Xms4g
-Xmx4g
-XX:+UseZGC

ZGC delivered sub‑millisecond pauses; the maximum GC pause fell from 340 ms to 8 ms, and latency spikes virtually disappeared.

Final metrics

P50 latency: 120 ms → 18 ms

P99 latency: 800 ms → 75 ms

Throughput: 1,200 req/s → 11,400 req/s

Allocation rate: ~2 GB/min → ~380 MB/min

Max GC pause: 340 ms → 8 ms

Key observations

Performance work driven by profiler data, not intuition.

Largest gains came from seemingly trivial places: logging, object allocation, caching, and GC strategy.

Upgrading to Java 21 and adopting virtual threads provided the single biggest benefit with minimal code changes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java performance optimization zgc virtual-threads Profiling GC Tuning Eclipse Collections

Written by

LuTiao Programming

LuTiao Programming is a friendly community offering free programming lessons. We inspire learners to explore new ideas and technologies and quickly acquire job-ready skills.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Baseline

Bottleneck analysis

Optimization 1 – Remove eager logging

Optimization 2 – Reduce object allocation

Optimization 3 – Primitive map to avoid boxing

Optimization 4 – Java 21 virtual threads

Optimization 5 – Cache JSON deserialization

Optimization 6 – GC tuning (G1 → ZGC)

Final metrics

Key observations

LuTiao Programming

How this landed with the community

Was this worth your time?

0 Comments

Optimization 1 – Remove eager logging

Optimization 2 – Reduce object allocation

Optimization 3 – Primitive map to avoid boxing

Optimization 4 – Java 21 virtual threads

Optimization 5 – Cache JSON deserialization

Optimization 6 – GC tuning (G1 → ZGC)