Databases 13 min read

How a Bulk Thread Log4j Bug Causes 20GB Memory Leaks in Elasticsearch 5.x

A production Elasticsearch 5.3.2 cluster showed a persistent 80% heap usage due to a hidden memory leak caused by Log4j thread‑local objects retaining large bulk request data, and the article walks through the investigation, heap‑dump analysis, source‑code tracing, and a practical fix using a JVM flag.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
How a Bulk Thread Log4j Bug Causes 20GB Memory Leaks in Elasticsearch 5.x

Background Introduction

During routine monitoring of an online Elasticsearch cluster, two data nodes reported heap usage continuously above the 80% warning threshold while the overall query load was low and old GC cycles could not reclaim memory.

Problem Investigation

The affected nodes were allocated a 30 GB heap, so 80% usage equated to roughly 24 GB. Yet the total index size across five nodes was under 10 GB, and segment memory and cache usage were only in megabytes. QPS was around 30, CPU under 10%, and thread‑pool activity was minimal.

The logs showed bulk update failures with DocumentMissingException , but no obvious resource‑related errors.

org.elasticsearch.index.engine.DocumentMissingException: [type][纳格尔果德_1198]: document missing

These exceptions were deemed harmless for business logic, so the investigation moved to a JVM heap dump.

Heap Dump Analysis

The Eclipse Memory Analyzer (MAT) was used. After downloading the binary heap dump with jmap -dump:format=b,file=/tmp/es_heap.bin <pid> , the dump was opened in MAT on a machine with sufficient RAM.

MAT required a one‑time indexing step that consumed significant CPU and memory; the analysis was performed on a server with a 20 GB heap setting in MemoryAnalyzer.ini .

<code>mat/ParseHeapDump.sh es_heap.bin org.eclipse.mat.api:suspects
mat/ParseHeapDump.sh es_heap.bin org.eclipse.mat.api:overview
mat/ParseHeapDump.sh es_heap.bin org.eclipse.mat.api:top_components</code>

The generated reports revealed that over 20 GB of retained heap was held by a group of bulk thread instances, each retaining about 1.5 GB.

In the "dominator_tree" view sorted by "Retained Heap", multiple bulk threads showed extremely high memory usage.

Expanding a thread's reference chain highlighted a MutableLogEvent holding a ParameterizedMessage , which in turn referenced the large BulkShardRequest object.

Problem Reproduction

A single‑node 5.3.2 test cluster was started, and bulk update requests were sent, deliberately including a missing doc_id . The processors: 1 setting reduced the bulk thread pool to a single thread, making the leak easier to observe.

After the bulk request, a heap dump was taken and analyzed, reproducing the same 20 GB leak. Varying bulk sizes showed that the leaked memory size correlated with the size of the last failed bulk request, confirming the link to bulk exception logging.

The same behavior persisted in Elasticsearch 5.6.3, indicating an unfixed bug.

Reading the Source Code

Investigation of TransportShardBulkAction (line 209) revealed the logging of bulk failures:

if (isConflictException(failure)) {
    logger.trace(() -> new ParameterizedMessage("{} failed to execute bulk item ({}) {}", request.shardId(), docWriteRequest.opType().getLowercase(), request), failure);
} else {
    logger.debug(() -> new ParameterizedMessage("{} failed to execute bulk item ({}) {}", request.shardId(), docWriteRequest.opType().getLowercase(), request), failure);
}

The ParameterizedMessage holds the entire BulkShardRequest . Because the logger uses a thread‑local MutableLogEvent , the message remains strongly referenced after the log call returns, preventing garbage collection.

Log4j creates the thread‑local event via ReusableLogEventFactory :

private static ThreadLocal&lt;MutableLogEvent&gt; mutableLogEventThreadLocal = new ThreadLocal&lt;&gt;();

The factory is chosen when Constants.ENABLE_THREADLOCALS is true, which is the default for non‑web applications like Elasticsearch.

public static final boolean ENABLE_THREADLOCALS = !IS_WEB_APP && PropertiesUtil.getProperties().getBooleanProperty("log4j2.enable.threadlocals", true);

Thus, Elasticsearch’s bulk threads retain the log event, causing the memory leak.

Mitigation

Adding the JVM option -Dlog4j2.enable.threadlocals=false to jvm.options disables the thread‑local log events and eliminates the leak, as confirmed by testing.

The issue has been reported on GitHub: Memory leak upon partial TransportShardBulkAction failure .

Conclusion

Elasticsearch’s complexity can hide subtle bugs like this Log4j‑induced memory leak. Comprehensive monitoring and experienced support are essential for maintaining system stability.

debuggingJavaElasticsearchmemory-leaklog4jHeap Dump
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.