Why 500 RPS Dropped to 50 RPS: Tracing Spring Bean Creation and Thread‑Pool Bottlenecks

A ToB Java service that initially expected 500 requests per second fell to only 50 RPS under load, prompting a step‑by‑step investigation of CPU usage, lock contention, slow SQL, excessive logging, prototype‑scoped beans, and thread‑pool configuration, ultimately revealing how bean creation and async execution affect throughput.

Architect
Architect
Architect
Why 500 RPS Dropped to 50 RPS: Tracing Spring Bean Creation and Thread‑Pool Bottlenecks

Background

The system was a low‑traffic ToB service that had never been load‑tested. A new client demanded a minimum of 500 requests/s per node. Assuming 100 Tomcat threads, each request should finish within 200 ms, which seemed trivial because most responses were ~100 ms in normal operation.

When a load test with 100 concurrent users was run, the observed throughput was only 50 RPS and CPU usage hovered around 80 %. A performance chart showed a minimum latency under 100 ms, a maximum of 5 s, and a 95th‑percentile around 4 s.

Initial Analysis

The team first ignored the high CPU metric and focused on latency. Potential blockers were identified:

Locks (synchronization, distributed, database)

Time‑consuming operations (network calls, SQL)

Latency‑monitoring hooks were added to log warnings when:

Interface response > 500 ms

Internal remote call > 200 ms

Redis access > 10 ms

SQL execution > 100 ms

Log inspection revealed a slow SQL statement that performed a concurrent update on a single row, causing lock wait times that accounted for more than 80 % of the request latency. The SQL was changed to asynchronous execution, reducing the maximum latency from 5 s to 2 s and the 95th‑percentile from 4 s to 1 s, roughly doubling throughput.

Further Investigation of the Remaining Bottleneck

Even after fixing the slow SQL, the throughput was still far from the target. Additional log analysis showed frequent thread switches, massive log output (≈500 MB in five minutes), and occasional stop‑the‑world pauses. The following actions were taken:

Raised the log level to DEBUG – only a modest 10 % improvement.

Re‑configured @Async thread pools: reduced the total core thread count to 50 and limited the maximum pool size, which added another ~50 RPS.

Increased JVM heap from 512 MB to 4 GB; GC frequency dropped from 4 /s to 2 /s, but throughput did not improve further.

These steps showed diminishing returns, indicating that the root cause lay elsewhere.

Root Cause of High CPU Usage

CPU usage remained high despite fewer threads. Thread‑level monitoring showed no single thread exceeding 10 % CPU, suggesting that many lightweight threads collectively consumed CPU cycles. Stack traces revealed a surprising pattern: each request repeatedly called BeanUtils.getBean(RedisMaster.class), which in turn invoked createBean on a prototype‑scoped Redis bean. Because the Redis client (Jedis) is not thread‑safe, the code created a new bean for every call, leading to ~200 bean creations per request and extensive lock contention inside Spring’s bean factory.

The fix was to replace the prototype bean lookup with a direct new Redis(...) construction, eliminating the repeated createBean overhead.

Additional Observations

The team also noted that the ubiquitous use of System.currentTimeMillis() and Hutool’s StopWatch for timing adds measurable overhead in high‑concurrency scenarios, as they ultimately rely on System.nanoTime().

Final Results

After applying the async SQL change, thread‑pool tuning, JVM memory increase, and prototype‑bean replacement, the maximum latency dropped from 5 s to 2 s, the 95th‑percentile from 4 s to 1 s, and overall throughput roughly doubled, though still short of the 500 RPS goal.

The experience highlighted three key factors that influence throughput and CPU usage:

Interface response time – shorter responses directly increase possible RPS.

Number of worker threads – more threads raise concurrency but are limited by CPU cores.

Code efficiency – lock contention, blocking I/O, and heavyweight framework operations (e.g., Spring bean creation) can dominate CPU time.

Open Questions & Future Work

Why does prototype‑scoped bean creation impose such a heavy performance penalty, and should it be avoided in high‑throughput services? Further study of JVM tuning, lock‑free data structures, and more precise profiling is needed. Overall, the case study demonstrates a systematic approach to performance debugging: start with observable metrics, instrument critical paths, isolate the most expensive operations, and iteratively apply targeted optimizations.

performance chart
performance chart
post‑optimization chart
post‑optimization chart
thread activity snapshot
thread activity snapshot
Spring bean creation stack trace
Spring bean creation stack trace
createBean source code
createBean source code
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendJavaperformanceoptimizationspringThroughputProfiling
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.