How Java Virtual Threads Cut Latency by 31× and Slash CPU Use in Production

This article explains the principles of Java virtual threads, compares them with traditional platform threads, details RedJDK21’s implementation and performance improvements—including up to 31‑fold latency reduction and 24% CPU savings—in large‑scale services at XiaoHongShu, and discusses migration challenges, lock handling, monitoring, and future roadmap.

Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
How Java Virtual Threads Cut Latency by 31× and Slash CPU Use in Production

1. Java Virtual Thread Concept

Modern multi‑core processors can theoretically accelerate workloads by launching a thread per core, but in practice most Java threads spend most of their time blocked, leading to low CPU utilization and high context‑switch overhead.

1.1 Virtual Thread vs Platform Thread

Virtual threads were introduced as an official feature in JDK 19 and became production‑ready in JDK 21. Unlike platform (OS) threads, a virtual thread is a lightweight object managed and scheduled by the JVM. Platform threads map one‑to‑one to kernel threads, consuming a kernel thread for each Java thread, while virtual threads share a small pool of carrier threads.

1.2 JVM Implementation of Virtual Threads

The JVM must solve three core problems to support virtual threads:

Correctly manage virtual‑thread state transitions.

Provide a scheduler that can mount and unmount virtual threads, preserving their context with minimal overhead.

Keep the mental model familiar to developers so existing code can be run with virtual threads without major rewrites.

1.3 Scheduling and Blocking Management

When a virtual thread blocks (e.g., on I/O, a lock, or JNI), the JVM unmounts it from its carrier thread and mounts another ready virtual thread. This is analogous to a restaurant where chefs (carrier threads) are not tied to a specific dish (virtual thread) and can immediately start cooking another order when one is waiting to rest.

The scheduler is a dedicated ForkJoinPool (Virtual FJP) whose workers correspond to carrier threads. Unlike the traditional FJP, the Virtual FJP uses a FIFO order and unmounts blocked virtual threads instead of suspending the worker.

1.4 Performance Evaluation

Using the PingPong benchmark bound to a single CPU, JDK 21 achieved roughly a 31× speed‑up over JDK 11. Latency (p90/p99) of virtual‑thread workloads was dramatically lower, and throughput scaled up to five times higher when concurrency increased.

Memory consumption was also reduced: creating 10 000 platform threads consumed about 349 MB of native memory, whereas 10 000 virtual threads used only ~8.5 MB of native memory plus ~43 MB of Java heap.

2. Practical Adoption at XiaoHongShu

RedJDK 21 was adapted to address real‑world constraints:

Modified the handling of Synchronized blocks and socket I/O to allow unmounting.

Added an automatic compensation mechanism that detects when carrier threads are blocked (e.g., in JNI) and spawns extra workers to avoid deadlock.

Extended monitoring tools to capture virtual‑thread stack traces, lock information, and CPU time, aligning them with existing JStack‑based diagnostics.

2.1 Lock Compatibility

Traditional lightweight locks store a pointer to a stack‑allocated lock record, which breaks when a virtual thread’s stack is migrated. RedJDK 21 introduced a new lightweight lock (Lightweight_Lock) that stores lock state in a LockStack associated with the virtual thread, making it compatible with stack migration. Heavyweight locks now use a global monitor table, freeing space in the object header.

JEP‑491 (targeted for JDK 24) will re‑implement Synchronized to behave like JUC locks, but RedJDK 21 provides a back‑port so virtual threads can be used on JDK 21.

2.2 Monitoring and Diagnostics

RedJDK 21 adds:

A full‑stack analysis tool that reports stack frames, lock state, mount status, and CPU time for virtual threads.

Support for ThreadMXBean operations on virtual threads.

Integration with existing thread‑pool monitoring, allowing virtual threads to be observed through the same dashboards.

Metrics for the virtual‑thread ForkJoinPool scheduler (dead‑lock detection, load, scheduling overhead).

2.3 Deployment Experience

Upgrading from RedJDK 11 to RedJDK 21 required only one container image, a single JVM parameter, and a copy operation. In production services (search, recommendation, advertising) the migration yielded:

~10% reduction in P90 response time.

~24% average CPU reduction.

OS thread count drop from ~5 000 to ~300, saving ~3 GB of memory.

Cache‑miss reduction of 30% and IPC increase of 13%.

In live‑stream recommendation workloads, switching from G1GC to the generational ZGC reduced P99 latency by ~200 ms.

3. Future Roadmap (Virtual Thread 2.0)

Planned improvements focus on stability, observability, and flexibility:

Better handling of blocking risks such as pinned virtual threads in class‑loading or JNI.

Mitigation of class‑loading deadlocks by introducing an object‑locker mechanism.

Support for custom virtual‑thread schedulers so users can provide their own ForkJoinPool configuration.

Enhanced monitoring: VT trace timing, blocking analysis, carrier‑thread load balancing, and work‑stealing diagnostics.

These efforts aim to make virtual threads a robust, production‑grade concurrency primitive for large‑scale Java services.

JavaJVMperformance optimizationconcurrencyVirtual ThreadsRedJDK21
Xiaohongshu Tech REDtech
Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.