Operations 16 min read

Why Is My Elasticsearch Cluster Consuming 15 GB Heap? A Deep Dive into Memory, Sharding, and Performance Bottlenecks

A 7‑node Elasticsearch cluster handling 500 million documents shows excessive heap usage, many deleted documents, high translog size, saturated OS memory, and inefficient sharding, prompting a detailed analysis of stats, root‑cause identification, and concrete recommendations for mapping, shard design, and JVM tuning.

dbaplus Community

Aug 1, 2023

Why Is My Elasticsearch Cluster Consuming 15 GB Heap? A Deep Dive into Memory, Sharding, and Performance Bottlenecks

Problem Overview

The cluster consists of 7 nodes (16 CPU × 32 GB RAM each) indexing 500 million documents. Despite calculating fielddata, completion, segments, query_cache, and translog memory, the JVM heap reaches 15 GB, leaving the source of the remaining memory usage unclear.

Statistical Analysis

Running GET _cluster/stats revealed several concerning metrics.

Issue 1: High Delete/Update Volume

{
  "docs": {
    "count": 331681467,
    "deleted": 73434046
  }
}

Over 73 million documents are marked as deleted, indicating frequent updates that leave tombstones until segment merges occur, potentially degrading performance and consuming storage.

Issue 2: Large Fixed‑Bit‑Set Memory

{
  "segments": {
    "fixed_bit_set_memory_in_bytes": 50741120,
    ...
  }
}

This ~48 MB value stores metadata for deleted documents, confirming a backlog of un‑purged deletions.

Issue 3: Uncommitted Translog Operations

{
  "translog": {
    "operations": 4171567,
    "size_in_bytes": 2854130582,
    "uncommitted_operations": 4171567,
    "uncommitted_size_in_bytes": 2854130582
  }
}

All translog operations remain uncommitted, risking longer recovery times after a crash.

Issue 4: OS Memory Saturation

{
  "os": {
    "mem": {
      "total_in_bytes": 32822083584,
      "free_in_bytes": 260890624,
      "used_percent": 99,
      "free_percent": 1
    }
  }
}

Only 1 % of system memory is free, which can cause swapping and severe performance degradation.

Issue 5: High JVM Heap Usage

{
  "jvm": {
    "mem": {
      "heap_used_in_bytes": 16480235136,
      "heap_used_percent": 76,
      "heap_committed_in_bytes": 21474836480,
      "heap_max_in_bytes": 21474836480
    }
  }
}

Heap usage sits at 76 % of the 20 GB limit, approaching the threshold where GC pauses become frequent.

Issue 6: Read‑Heavy I/O

{
  "io_stats": {
    "total": {
      "operations": 5250539512,
      "read_operations": 4478787246,
      "write_operations": 771752266,
      "read_kilobytes": 129711481927,
      "write_kilobytes": 23684659984
    }
  }
}

Read operations outnumber writes by roughly 5‑to‑1, which is normal for query‑heavy workloads but still worth monitoring.

Issue 7: Low Query‑Cache Hit Rate

{
  "query_cache": {
    "memory_size_in_bytes": 422629063,
    "total_count": 18178614894,
    "hit_count": 4107645935,
    "miss_count": 14070968959,
    "cache_size": 405975,
    "evictions": 16464511
  }
}

Misses far exceed hits, indicating either highly diverse queries or insufficient cache sizing.

Root Causes

Mapping design creates oversized fields (e.g., tokenizing hash values) that bloat the index.

Sharding strategy mimics MySQL’s table‑splitting, creating ~300 shards per index, causing excessive per‑shard overhead and large buffer pool usage.

Frequent updates/deletes generate many tombstones and translog entries.

High concurrent query volume amplifies memory pressure.

Recommendations

Consolidate shards: reduce from hundreds to a handful (e.g., 8) to lower per‑shard memory and CPU overhead.

Refine mapping: use keyword for IDs/hashes, avoid unnecessary text analysis, disable _source or term vectors where not needed.

Adopt time‑based indices or rollover patterns to limit the lifespan of hot shards.

Force‑merge deleted segments during low‑traffic windows to reclaim space.

Commit translog regularly (or configure index.translog.flush_threshold_size) to keep recovery size manageable.

Increase OS memory or add nodes if workload growth is expected.

Monitor JVM heap, GC pauses, and OS memory; adjust heap.max only after confirming node stability.

Resize query cache or tune indices.queries.cache.size and consider disabling caching for highly variable queries.

Best‑Practice Takeaways

Elasticsearch sharding is not a direct analogue of relational database partitioning; over‑sharding inflates memory usage and CPU load. Proper field type selection, judicious shard sizing (20‑40 GB per shard), and regular monitoring of cluster stats are essential for a stable, performant deployment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

JVM Indexing Memory optimization Cluster Monitoring

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Problem Overview

Statistical Analysis

Issue 1: High Delete/Update Volume

Issue 2: Large Fixed‑Bit‑Set Memory

Issue 3: Uncommitted Translog Operations

Issue 4: OS Memory Saturation

Issue 5: High JVM Heap Usage

Issue 6: Read‑Heavy I/O

Issue 7: Low Query‑Cache Hit Rate

Root Causes

Recommendations

Best‑Practice Takeaways

dbaplus Community

How this landed with the community

Was this worth your time?

0 Comments

Issue 1: High Delete/Update Volume

Issue 2: Large Fixed‑Bit‑Set Memory

Issue 3: Uncommitted Translog Operations

Issue 4: OS Memory Saturation

Issue 5: High JVM Heap Usage

Issue 6: Read‑Heavy I/O

Issue 7: Low Query‑Cache Hit Rate