Why ThreadLocal Leaks Cause Full GC in Thread Pools and How to Fix It
An online service experienced frequent Full GC due to a ThreadLocal memory leak in a thread‑pool scenario; the article walks through GC log analysis, heap dump inspection, root‑cause discovery in ThreadFactory logic, and presents a robust fix using Spring’s TaskDecorator to ensure proper cleanup.
1. Background
During a routine production‑environment inspection, the GC metrics of the bill service showed a sharp increase in Full GC frequency that did not converge, indicating a potential memory issue.
GC Log Analysis
Using GCeasy, the Old Generation memory usage was observed to climb steadily from ~300 MB to nearly 500 MB, with before‑GC and after‑GC curves almost overlapping, meaning the GC was not reclaiming memory. This pattern is a classic sign of a memory leak: after a Full GC the Old Generation should shrink noticeably, but it stayed flat, implying many objects were still strongly referenced.
The upward trend in the Old Generation confirms that leaked objects accumulate there because they survive repeated Minor GCs and are eventually promoted.
2. Heap Dump Analysis with MAT
A heap dump ( .hprof) was taken from the production environment and opened in Eclipse Memory Analyzer (MAT). The Dominator Tree revealed that several thread objects each held a large number of PriceVO$SkuPriceVO instances, all attached to a ThreadLocal.
Expanded reference chain for one thread:
java.lang.Thread
└── threadLocals: java.lang.ThreadLocal$ThreadLocalMap
└── table: java.lang.ThreadLocal$ThreadLocalMap$Entry[]
└── [n]: Entry
└── value: java.util.concurrent.ConcurrentHashMap
└── key: String (cacheKey)
└── value: List<PriceVO.SkuPriceVO> // massive leak objectsThe problem became clear: objects cached in ThreadLocal were never cleared, and as requests kept coming, the cache grew until it exhausted the Old Generation, triggering repeated Full GCs.
Why ThreadLocal Leaks Occur
Each Thread holds a ThreadLocalMap, a hash table keyed by a ThreadLocal instance (weak reference) and valued by the actual stored object (strong reference). In a simple request‑per‑thread model, the thread dies and the entire map is reclaimed. In a thread‑pool, however, threads are reused; the map lives as long as the thread does, so values are never collected.
If code writes data into ThreadLocal for each task but never clears it, the values accumulate in the thread’s map, eventually exhausting the Old Generation and causing a Full GC loop.
3. Code Review: Finding the Write Point and the Ineffective Cleanup
The offending write occurs in a price‑filter method:
public List<PriceVO.SkuPriceVO> listFilterNoCost(Long customerId, Long warehouseId, Long corpId, List<...> skuUnitList) {
// Try to get from ThreadLocal cache first
String cacheKey = cacheHashCode(dto);
Collection<PriceVO.SkuPriceVO> avgPriceList = ScopeCacheUtil.get(cacheKey);
if (CollectionUtils.isEmpty(avgPriceList)) {
avgPriceList = listSkuAvgPrice(dto);
ScopeCacheUtil.put(cacheKey, avgPriceList); // write to ThreadLocal
}
// ... filtering logic
}The intention is to cache price results per request, but the cache must be cleared after the task finishes.
The thread‑pool configuration includes a cleanup call in a custom ThreadFactory:
public ThreadPoolTaskExecutor taskExecutor() {
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setThreadFactory(r -> {
Thread thread = new Thread(new ContextRelatedRunnable() {
@Override
public void doRun() {
r.run();
ScopeCacheUtil.clearContext(); // clear ThreadLocal
}
});
// ...
return thread;
});
return executor;
}Debugging revealed that clearContext() never executed because the r passed to the factory is not the business Runnable but a ThreadPoolExecutor$Worker that runs an internal loop. Consequently, the cleanup line is effectively dead code.
4. Root Cause: Misunderstanding ThreadFactory vs. Task Execution
The ThreadFactory creates the thread itself; the Worker object it receives is the thread’s own run loop, not the submitted task. The worker’s run() method enters a perpetual runWorker(this) loop, so any code placed after r.run() in the factory’s doRun() never gets a chance to execute.
Thus the cleanup logic was attached to the wrong lifecycle hook.
5. Solution: Use Spring’s TaskDecorator
Instead of trying to clean up in the thread‑creation phase, decorate each submitted task so that cleanup runs after the task finishes, regardless of success or exception.
public ThreadPoolTaskExecutor taskExecutor() {
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
// ... basic pool config
executor.setTaskDecorator(runnable -> {
return () -> {
try {
runnable.run();
} finally {
ScopeCacheUtil.clearContext(); // always clean ThreadLocal
}
};
});
return executor;
}The decorator wraps the original Runnable in a new lambda that guarantees clearContext() runs in a finally block. After applying this change, debugging shows the runtime type of the wrapped task is CompletableFuture$AsyncRun, the actual business task, and the cleanup executes correctly, stopping the memory leak.
6. Extended Thoughts
6.1 ThreadFactory vs. TaskDecorator: Responsibility Boundary
ThreadFactory : invoked when a thread is created; scope is thread‑level; suitable for setting thread name, priority, UncaughtExceptionHandler, etc.
TaskDecorator : invoked when a task is submitted; scope is task‑level; suitable for MDC propagation, ThreadLocal cleanup, tracing context, etc.
In short, use ThreadFactory for thread‑wide settings and TaskDecorator for per‑task hooks.
6.2 Best Practices for ThreadLocal in Thread Pools
Always clean up in a finally block, e.g., ThreadLocal.remove().
Prefer a TaskDecorator as a safety net even if the business code attempts cleanup.
Consider alternatives such as passing data via method parameters or using TransmittableThreadLocal (Alibaba open‑source) for cross‑thread‑pool propagation.
Monitor GC logs and heap trends regularly to catch leaks early.
6.3 General Steps for Diagnosing Memory Leaks
1. Detect anomaly
└── Monitoring alert / inspection shows frequent Full GC
2. Confirm leak
└── Analyze GC logs (GCeasy / GCViewer)
└── Observe Old Generation trend: does memory drop after GC?
3. Locate leaking objects
└── jmap -dump to export heap snapshot
└── Analyze with MAT: Leak Suspects / Dominator Tree / Histogram
└── Identify largest objects and their GC‑root reference chains
4. Code review
└── Trace reference chain back to write point
└── Verify cleanup logic exists and actually runs
5. Fix & verify
└── Reproduce locally, debug
└── After fix, monitor memory trend for recovery7. Summary
The root cause was a ThreadLocal that was never cleared because the cleanup code was placed in a ThreadFactory where it never executed. By moving the cleanup to a Spring TaskDecorator, each task now reliably clears its ThreadLocal, eliminating the memory leak and the associated Full GC storms.
If you also combine thread pools with ThreadLocal, double‑check whether your cleanup resides in the ThreadFactory or the TaskDecorator—otherwise you may end up with code that never runs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
WeiLi Technology Team
Practicing data-driven principles and believing technology can change the world.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
