Diagnosing HTTP Timeout Issues in Spring Boot Microservices Caused by Caffeine Cache Eviction Lock Contention
The article analyzes a weekend outage of a Spring Boot microservice where all HTTP requests timed out, tracing the root cause to Caffeine's synchronous cache eviction lock being blocked by a long‑running compute operation, and proposes using AsyncCache with a dedicated thread pool to avoid similar contention.
During a weekend, a single instance of a Spring Boot + Cloud microservice that uses both MVC and reactive WebFlux experienced complete HTTP request timeouts, causing Kubernetes health‑check failures and automatic restarts, while other instances remained healthy.
Investigation Approach
JFR was enabled with disk=true , dumping events to /tmp/<process‑start‑time>.<pid> . The configuration limited files to maxsize=4096m , maxage=3d , and split each chunk at maxchunksize=128m . These settings allowed continuous collection of all JFR data without exhausting disk space.
Thread dumps revealed that most HTTP servlet threads were in a WAITING state inside Caffeine cache loading, indicating they were blocked on the cache lock.
The cache configuration (shown in the original diagrams) uses Caffeine with expiration policies, which internally relies on ConcurrentHashMap operations such as computeIfPresent and compute . These operations acquire a node‑level lock before performing the user‑provided computation.
Further JFR events (Java Monitor Blocked and Method Profiling Sample) identified that a thread from ForkJoinPool.commonPool() held the eviction lock while a long‑running computation for a specific key was in progress. The same common pool is also used by Caffeine’s periodic cleanup task ( maintenance ), creating a lock‑contention scenario.
Root Cause – Caffeine Synchronous Cache Mechanism Defect
The expiration and periodic cleanup both invoke the maintenance method, which is protected by an EvictionLock . When the cleanup thread acquires this lock, it scans keys; if a key is currently being loaded by a slow compute operation, the cleanup thread blocks waiting for the same node lock. Consequently, all other threads that need the eviction lock are blocked, leading to the observed HTTP timeouts.
Caffeine’s cache expiration and cleanup share the same maintenance path.
The maintenance method is guarded by EvictionLock .
If the cleanup thread obtains the lock and scans a key whose loading thread holds the node lock, the cleanup blocks.
Other business threads that finish loading fast still wait for the eviction lock.
A single slow loading thread (e.g., taking ~1 minute) can therefore stall the entire cache.
Community Discussion
An issue was opened on the Caffeine GitHub repository (https://github.com/ben-manes/caffeine/issues/768). The maintainer confirmed this is a known problem that cannot be mitigated directly and suggested using AsyncCache to decouple the computation from the map operation. The response also mentioned a warning log added in v3.0.6:
The cache is experiencing excessive wait times for acquiring the eviction lock. This may indicate that a long‑running computation has halted eviction when trying to remove the victim entry. Consider using AsyncCache to decouple the computation from the map operation.
Switching to Async Cache and Precautions
When using AsyncCache , the underlying ConcurrentHashMap stores CompletableFuture values, so the expensive computation runs outside the map lock, preventing eviction‑lock contention. It is recommended to provide a dedicated thread pool for cache loading instead of the default ForkJoinPool.commonPool() , and to switch back to the original pool after the cache call when integrating with WebFlux to avoid accidental blocking of I/O‑heavy tasks.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.