How to Diagnose and Fix Elasticsearch Cluster Health Issues
This guide explains how to monitor Elasticsearch cluster health, interpret green/yellow/red statuses, troubleshoot unassigned shards, adjust JVM and system settings, resolve common configuration errors, and use scripts and APIs to keep your ELK stack stable and performant.
1. Elasticsearch Cluster Health
An Elasticsearch cluster can consist of a single node or many nodes; the Cluster Health API provides a high‑level view of the cluster’s status (green, yellow, red) in JSON, which is easy to parse for automation and alerting.
# curl -XGET 'http://10.0.8.47:9200/_cat/nodes?v'
host ip heap.percent ram.percent load node.role master name
10.0.8.47 10.0.8.47 53 85 0.16 d * elk-node03.kevin.cn
10.0.8.44 10.0.8.44 26 54 0.09 d m elk-node01.kevin.cn
10.0.8.45 10.0.8.45 71 81 0.02 d m elk-node02.kevin.cnTypical health states:
green : all primary and replica shards are allocated; the cluster is 100 % available.
yellow : all primary shards are allocated but some replicas are missing; data is safe but redundancy is reduced.
red : at least one primary shard is unassigned, causing data loss for the affected index.
2. Elasticsearch Index Status
# curl -XGET 'http://10.0.8.47:9200/_cat/indices?v'
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open 10.0.61.24-vfc-intf-ent-deposit.log-2019.03.15 5 1 159 0 324.9kb 162.4kb
... (additional rows omitted for brevity)The index health uses the same green/yellow/red semantics as the cluster health.
3. Related Concepts
A node runs an Elasticsearch instance. A cluster is a group of nodes sharing the same cluster.name. One node is elected master to manage metadata such as index creation, node addition/removal, and shard allocation.
Shards are the basic units of data storage; each index is split into primary shards and optional replica shards. The routing value (by default the document _id) is hashed to determine the target primary shard:
shard = hash(routing) % number_of_primary_shardsWrites must succeed on the primary shard before being replicated to its replicas.
4. Red Cluster Diagnosis
When the cluster status is red, at least one primary shard is unassigned. Common reasons include:
INDEX_CREATED – index created but not allocated.
CLUSTER_RECOVERED – full cluster recovery failure.
INDEX_REOPENED – index reopened without proper allocation.
DANGLING_INDEX_IMPORTED – imported dangling index.
ALLOCATION_FAILED – shard allocation failure.
NODE_LEFT – node that held the shard left the cluster.
... (other reasons omitted for brevity)
Typical remediation steps:
Ensure the master node starts first, then data nodes.
Disable SELinux if unnecessary and close iptables.
Verify Elasticsearch configuration on data nodes.
Increase the system’s maximum open file descriptors.
Adjust JVM heap size ( ES_HEAP_SIZE) and indices.fielddata.cache.size.
Delete unused or stale indices to reduce shard count.
5. Case Study: ELK Cluster Health Issue
In a production ELK deployment, log volume grew, causing memory pressure and many indices. The cluster turned red, Kibana stopped displaying logs, and the Head plugin hung.
# curl -XGET 'http://10.0.8.47:9200/_cat/health?v'
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1554689492 10:11:32 kevin-elk red 3 3 3587 3447 0 0 0 - 100.0Resolution steps included:
Increase nofile limit to 65535 and disable swap.
Raise JVM heap (e.g., ES_HEAP_SIZE=8g) and set bootstrap.mlockall: true.
Set indices.fielddata.cache.size: 40% to limit fielddata memory.
Delete old indices (e.g., keep only the last month’s data).
Restart all Elasticsearch nodes.
After the fixes, the cluster eventually reported green once unassigned shards reached zero.
6. Common Errors and Fixes
SettingsException : malformed elasticsearch.yml. Ensure a space between key and value (e.g., node.name: elk-node03.kevin.cn).
StartupException : running as root. Create a dedicated elasticsearch user and run the service under that account.
JVM memory allocation failure : reduce heap size or increase system memory; edit jvm.options or /etc/sysconfig/elasticsearch.
Bootstrap checks failed : increase vm.max_map_count (e.g., to 655360) and raise nofile limits.
7. Monitoring Elasticsearch
Simple shell commands can poll cluster health:
# curl -XGET 'http://10.0.8.47:9200/_cat/health?v'
... output shows status green/yellow/red ...A Python script can automate this check and trigger alerts:
import commands
command = 'curl 10.0.8.47:9200/_cat/health'
status = commands.getstatusoutput(command)[1].split(' ')[1]
print(0 if status == 'red' else 1)8. Preventing Split‑Brain
Separate master‑eligible nodes from data nodes and configure discovery.zen.minimum_master_nodes to eligible_master_number/2 + 1 to avoid split‑brain scenarios.
# Master‑only node
node.master: true
node.data: false
discovery.zen.minimum_master_nodes: 2
# Data node
node.master: false
node.data: trueAfter updating elasticsearch.yml, restart the nodes to apply the new roles.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
