Operations 16 min read

Why Is My Elasticsearch Cluster Using 15 GB Heap? A Deep Dive into Memory Bottlenecks

The article examines a 7‑node Elasticsearch cluster with 500 million documents, uncovering excessive heap usage, high OS memory pressure, numerous deleted documents, large translog, low query‑cache hit rate, and an over‑sharded design, then offers concrete tuning and redesign recommendations to restore performance.

ITPUB
ITPUB
ITPUB
Why Is My Elasticsearch Cluster Using 15 GB Heap? A Deep Dive into Memory Bottlenecks

Problem Description

In a 7‑node, 16 × 32 GB environment the user is indexing 500 million documents. Each node hosts 480 shards, leading to a total of 3360 shards. Heap memory reaches 15 GB, but the source of the remaining memory consumption is unclear.

Analysis

2.1 Large number of deleted or updated documents

{
  "docs": {
    "count": 331681467,
    "deleted": 73434046
  }
}

Deleted documents occupy space until a segment merge occurs. Frequent updates cause many deletions, increasing storage and merge overhead.

2.2 Uncleaned deleted documents

{
  "segments": {
    "count": 3705,
    "fixed_bit_set_memory_in_bytes": 50741120,
    ...
  }
}

The fixed_bit_set_memory_in_bytes value (~48 MB) indicates a large set of deleted docs still tracked in memory, which can degrade query performance.

2.3 Uncommitted translog operations

{
  "translog": {
    "operations": 4171567,
    "size_in_bytes": 2854130582,
    "uncommitted_operations": 4171567,
    "uncommitted_size_in_bytes": 2854130582
  }
}

All translog operations are uncommitted, meaning they have not been flushed to Lucene. This can lengthen recovery time after a crash.

2.4 OS memory usage

{
  "os": {
    "mem": {
      "total_in_bytes": 32822083584,
      "free_in_bytes": 260890624,
      "used_in_bytes": 32561192960,
      "free_percent": 1,
      "used_percent": 99
    }
  }
}

Only 1 % of system memory is free, creating a high risk of swapping or out‑of‑memory errors.

2.5 JVM heap usage

{
  "jvm": {
    "mem": {
      "heap_used_in_bytes": 16480235136,
      "heap_used_percent": 76,
      "heap_committed_in_bytes": 21474836480,
      "heap_max_in_bytes": 21474836480
    },
    "gc": {
      "collectors": {
        "young": { "collection_count": 434416 },
        "old": { "collection_count": 0 }
      }
    }
  }
}

Heap usage is at 76 % of the 20 GB allocated heap, approaching the warning threshold and increasing GC pressure.

2.6 I/O statistics – reads dominate writes

{
  "io_stats": {
    "total": {
      "operations": 5250539512,
      "read_operations": 4478787246,
      "write_operations": 771752266,
      "read_kilobytes": 129711481927,
      "write_kilobytes": 23684659984
    }
  }
}

Read operations far exceed writes, which is normal for query‑heavy workloads but may require I/O tuning.

2.7 Low query‑cache hit rate

{
  "query_cache": {
    "memory_size_in_bytes": 422629063,
    "total_count": 18178614894,
    "hit_count": 4107645935,
    "miss_count": 14070968859,
    "cache_size": 405975,
    "evictions": 16464511
  }
}

The cache hit ratio is low, suggesting diverse queries or insufficient cache size.

Problem Summary

OS memory usage at 99 % (free % = 1)

JVM heap at 76 % of its 20 GB limit

Large number of deleted documents (≈73 M) and high fixed_bit_set_memory_in_bytes All translog operations uncommitted, risking long recovery

Over‑sharding: 300 logical indices → 480 shards per node, causing high per‑shard overhead

Low query‑cache efficiency

Root Causes

Mapping design creates oversized fields (e.g., hashing fields tokenized unnecessarily)

Sharding strategy copied from MySQL – splitting a single index into hundreds of shards, inflating memory per query

Heavy update/delete workload generating many deleted docs

High concurrent query load increasing I/O and cache pressure

Recommendations

Merge shards : Reduce from hundreds to a handful (e.g., 8–10 shards) to lower per‑shard memory overhead.

Refine mapping : Use keyword for IDs/hashes, avoid unnecessary text analysis, disable _source or norms where possible.

Adjust index lifecycle : Adopt time‑based indices, roll over daily, and delete old indices to limit deletions.

Force‑merge deleted segments during low‑traffic windows.

Increase OS memory or add nodes to bring free memory above 10 %.

Tune JVM : Consider raising heap only if needed, monitor GC, and keep heap_used_percent below 70 %.

Resize query cache or set "cache": false on low‑benefit queries.

Monitor key metrics (shard count, heap, OS mem, translog size) with _cluster/stats, _nodes/stats, and _cat/indices.

By aligning Elasticsearch design with its native concepts—moderate shard counts, precise mappings, and proactive memory management—the cluster can regain stability and performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Memory OptimizationElasticsearchshardingcluster operationsheap usage
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.