Operations 17 min read

Why Is My Elasticsearch Cluster Using 99% Memory? Sharding, Translog & JVM Insights

This article analyzes a 7‑node Elasticsearch cluster with 500 million documents, revealing excessive shard count, high heap and OS memory usage, large translog, low query‑cache hit rate, and heavy I/O, and offers concrete recommendations on mapping, sharding, JVM tuning, and resource management to restore performance.

Programmer DD
Programmer DD
Programmer DD
Why Is My Elasticsearch Cluster Using 99% Memory? Sharding, Translog & JVM Insights

Problem Description

In a 7‑node, 16 × 32 GB environment we attempted memory optimization for an index containing 500 million documents. The mapping and shard design were flawed, resulting in 480 shards per node – an absurdly high number.

Heap memory consumption reached 15 GB, yet fielddata, completion, segments, query_cache and translog only accounted for a portion of the usage, leaving the source of the remaining memory unclear.

Analysis

Finding 1 – Massive Delete/Update Operations

GET _cluster/stats

The deleted count is 73,434,046, indicating frequent updates that create many deleted documents which linger until segment merge.

Recommendation: Reduce update frequency, use time‑based indices, and perform regular force merges during low‑traffic periods.

Finding 2 – Large Amount of Uncleaned Deleted Docs

Key parameters:

fixed_bit_set_memory_in_bytes : 50,741,120 bytes (~48 MB) – memory storing info about deleted docs.

index_writer_memory_in_bytes : 54,801,608 bytes – normal.

Risk: Uncleaned deleted docs increase storage and can degrade query performance. Use force_merge cautiously.

Finding 3 – Translog Holds Many Uncommitted Operations

{
  "translog": {
    "operations": 4171567,
    "size_in_bytes": 2854130582,
    "uncommitted_operations": 4171567,
    "uncommitted_size_in_bytes": 2854130582
  }
}

These large uncommitted operations may prolong recovery after a crash. Regularly commit translog data to Lucene.

Finding 4 – OS Memory Usage Near 100%

{
  "os": {
    "mem": {
      "total_in_bytes": 32822083584,
      "free_in_bytes": 260890624,
      "used_percent": 99,
      "free_percent": 1
    }
  }
}

With only 1 % free memory, the system risks swapping and severe performance degradation. Immediate memory relief or hardware upgrade is required.

Finding 5 – High JVM Heap Usage

{
  "jvm": {
    "mem": {
      "heap_used_in_bytes": 16480235136,
      "heap_used_percent": 76,
      "heap_committed_in_bytes": 21474836480,
      "heap_max_in_bytes": 21474836480
    },
    "gc": {
      "collectors": {
        "young": {"collection_count": 434416},
        "old": {"collection_count": 0}
      }
    }
  }
}

Heap usage at 76 % is approaching the warning threshold; frequent GC may affect latency. Monitor and consider adjusting heap size.

Finding 6 – Read Operations Far Exceed Writes

{
  "io_stats": {
    "total": {
      "read_operations": 4478787246,
      "write_operations": 771752266
    }
  }
}

Read‑heavy workloads are normal for search use‑cases, but ensure I/O subsystems are sized accordingly.

Finding 7 – Low Query Cache Hit Rate

{
  "query_cache": {
    "memory_size_in_bytes": 422629063,
    "total_count": 18178614894,
    "hit_count": 4107645935,
    "miss_count": 14070968959,
    "evictions": 16464511
  }
}

Hit ratio is low, suggesting diverse queries or insufficient cache size. Tune cache settings or disable caching for non‑beneficial queries.

Root Causes

The primary issues stem from poor mapping (over‑indexing, unnecessary text analysis) and an over‑aggressive sharding strategy that mimics MySQL’s table‑splitting, creating ~300 indices per logical index.

Each shard is a full Lucene index; excessive shards inflate memory usage, increase buffer pool consumption, and degrade performance.

Recommendations

Redesign mapping: use precise field types, store IDs as keyword, avoid unnecessary analyzers.

Consolidate shards: aim for 20‑40 GB per shard, reduce total shard count dramatically (e.g., from hundreds to ~8).

Perform regular force_merge on stale indices.

Commit translog data frequently to keep its size manageable.

Monitor JVM heap, OS memory, and query cache metrics; adjust heap size and cache parameters as needed.

Consider index aliases or routing to limit the number of shards touched per query.

Continuous monitoring and performance testing are essential to maintain a healthy Elasticsearch cluster.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JVMMemory OptimizationElasticsearchshardingCluster Monitoring
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.