Databases 28 min read

How to Diagnose and Fix Elasticsearch Cluster Health Issues

This guide explains how to monitor Elasticsearch cluster health, interpret green/yellow/red statuses, troubleshoot unassigned shards, adjust JVM and system settings, resolve common configuration errors, and use scripts and APIs to keep your ELK stack stable and performant.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How to Diagnose and Fix Elasticsearch Cluster Health Issues

1. Elasticsearch Cluster Health

An Elasticsearch cluster can consist of a single node or many nodes; the Cluster Health API provides a high‑level view of the cluster’s status (green, yellow, red) in JSON, which is easy to parse for automation and alerting.

# curl -XGET 'http://10.0.8.47:9200/_cat/nodes?v'
host      ip          heap.percent ram.percent load node.role master name
10.0.8.47 10.0.8.47   53          85          0.16 d          *      elk-node03.kevin.cn
10.0.8.44 10.0.8.44   26          54          0.09 d          m      elk-node01.kevin.cn
10.0.8.45 10.0.8.45   71          81          0.02 d          m      elk-node02.kevin.cn

Typical health states:

green : all primary and replica shards are allocated; the cluster is 100 % available.

yellow : all primary shards are allocated but some replicas are missing; data is safe but redundancy is reduced.

red : at least one primary shard is unassigned, causing data loss for the affected index.

2. Elasticsearch Index Status

# curl -XGET 'http://10.0.8.47:9200/_cat/indices?v'
health status index                                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   10.0.61.24-vfc-intf-ent-deposit.log-2019.03.15 5   1   159        0            324.9kb    162.4kb
... (additional rows omitted for brevity)

The index health uses the same green/yellow/red semantics as the cluster health.

3. Related Concepts

A node runs an Elasticsearch instance. A cluster is a group of nodes sharing the same cluster.name. One node is elected master to manage metadata such as index creation, node addition/removal, and shard allocation.

Shards are the basic units of data storage; each index is split into primary shards and optional replica shards. The routing value (by default the document _id) is hashed to determine the target primary shard:

shard = hash(routing) % number_of_primary_shards

Writes must succeed on the primary shard before being replicated to its replicas.

4. Red Cluster Diagnosis

When the cluster status is red, at least one primary shard is unassigned. Common reasons include:

INDEX_CREATED – index created but not allocated.

CLUSTER_RECOVERED – full cluster recovery failure.

INDEX_REOPENED – index reopened without proper allocation.

DANGLING_INDEX_IMPORTED – imported dangling index.

ALLOCATION_FAILED – shard allocation failure.

NODE_LEFT – node that held the shard left the cluster.

... (other reasons omitted for brevity)

Typical remediation steps:

Ensure the master node starts first, then data nodes.

Disable SELinux if unnecessary and close iptables.

Verify Elasticsearch configuration on data nodes.

Increase the system’s maximum open file descriptors.

Adjust JVM heap size ( ES_HEAP_SIZE) and indices.fielddata.cache.size.

Delete unused or stale indices to reduce shard count.

5. Case Study: ELK Cluster Health Issue

In a production ELK deployment, log volume grew, causing memory pressure and many indices. The cluster turned red, Kibana stopped displaying logs, and the Head plugin hung.

# curl -XGET 'http://10.0.8.47:9200/_cat/health?v'
epoch   timestamp   cluster   status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1554689492 10:11:32 kevin-elk red   3         3        3587   3447 0    0    0      -        100.0

Resolution steps included:

Increase nofile limit to 65535 and disable swap.

Raise JVM heap (e.g., ES_HEAP_SIZE=8g) and set bootstrap.mlockall: true.

Set indices.fielddata.cache.size: 40% to limit fielddata memory.

Delete old indices (e.g., keep only the last month’s data).

Restart all Elasticsearch nodes.

After the fixes, the cluster eventually reported green once unassigned shards reached zero.

6. Common Errors and Fixes

SettingsException : malformed elasticsearch.yml. Ensure a space between key and value (e.g., node.name: elk-node03.kevin.cn).

StartupException : running as root. Create a dedicated elasticsearch user and run the service under that account.

JVM memory allocation failure : reduce heap size or increase system memory; edit jvm.options or /etc/sysconfig/elasticsearch.

Bootstrap checks failed : increase vm.max_map_count (e.g., to 655360) and raise nofile limits.

7. Monitoring Elasticsearch

Simple shell commands can poll cluster health:

# curl -XGET 'http://10.0.8.47:9200/_cat/health?v'
... output shows status green/yellow/red ...

A Python script can automate this check and trigger alerts:

import commands
command = 'curl 10.0.8.47:9200/_cat/health'
status = commands.getstatusoutput(command)[1].split(' ')[1]
print(0 if status == 'red' else 1)

8. Preventing Split‑Brain

Separate master‑eligible nodes from data nodes and configure discovery.zen.minimum_master_nodes to eligible_master_number/2 + 1 to avoid split‑brain scenarios.

# Master‑only node
node.master: true
node.data: false
discovery.zen.minimum_master_nodes: 2

# Data node
node.master: false
node.data: true

After updating elasticsearch.yml, restart the nodes to apply the new roles.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Elasticsearchjvm-tuningShard Allocationcluster healthmonitoring scriptsSplit-Brain Prevention
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.