Operations 15 min read

Master Elasticsearch Monitoring: Key System, Cluster, and Resource Metrics Explained

This article expands on Elasticsearch performance monitoring by detailing host‑level system and network metrics, cluster health and node availability, resource saturation, thread‑pool behavior, cache usage, and common error indicators, offering practical guidance for maintaining a stable and efficient search cluster.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Master Elasticsearch Monitoring: Key System, Cluster, and Resource Metrics Explained

Previous article recap

The first part covered two monitoring areas: query and indexing performance, and memory allocation with garbage collection.

Host‑level system and network metrics

Beyond Elasticsearch‑specific metrics, you should collect host‑level data from each node.

Disk space

If free disk space falls below 20 % you should delete large indices (e.g., with Curator) or add nodes to redistribute shards; analyzed fields consume more space than non‑analyzed fields.

I/O utilization

Heavy segment creation, search, and merging cause significant disk reads/writes; SSDs are recommended for write‑heavy clusters.

CPU utilization

Monitor CPU usage per node type (data, master, client) with visual charts; rising CPU often indicates heavy search or indexing load, prompting alerts and possible node scaling.

Network throughput

Track bytes sent/received between nodes to ensure the network can handle replication and shard relocation traffic.

Open file descriptors

File descriptors support inter‑node communication and client connections; if usage exceeds 80 % of the limit, increase the system’s max file descriptor count (e.g., to 64 000 on Linux).

HTTP connections

All non‑Java clients use the RESTful HTTP API; a growing number of open HTTP connections may indicate missing persistent connections and added latency.

Cluster health and node availability

Cluster health status

YELLOW means at least one replica shard is unassigned; RED indicates a primary shard loss, leading to partial search results and inability to index new documents. Set alerts for prolonged YELLOW or any RED state.

Initializing and unassigned shards

Shards stay in initializing or unassigned states while the master assigns them; prolonged periods suggest cluster instability.

Resource saturation and related errors

Thread‑pool queues and rejections

Key thread pools include search, index, merge, and bulk. Queue size shows pending requests; when a pool reaches its max queue, further requests are rejected. Monitor queue growth and rejections to decide on scaling or throttling.

Bulk request queue and rejections

Bulk requests should be used instead of many single requests; rejections often stem from overly large bulk payloads. Implement back‑off strategies to handle them gracefully.

Cache usage metrics

Elasticsearch employs two main caches: fielddata (used for sorting and aggregations) and filter (used for cached filtered queries). Fielddata can consume large heap memory; limit it to ~20 % of heap and prefer doc values when possible. Filter cache behavior changed after version 2.0, with automatic eviction based on segment size and usage frequency.

Fielddata cache eviction

When fielddata reaches the configured heap limit, the least‑recently used data is evicted; consider limiting fielddata to 20 % of heap and using doc values to reduce pressure.

Filter cache eviction

Relevant only for Elasticsearch < 2.0; frequent evictions may indicate excessive creation of new filters. Optimize queries (e.g., use boolean queries) to improve cache reuse.

Pending tasks

Handled by the master node, pending tasks include index creation and shard allocation. A rising count indicates the master is overloaded and may signal cluster instability.

Failed GET requests

GET‑by‑ID failures mean the document was not found; while not common, monitoring them can reveal indexing gaps.

Conclusion

Across the two articles we covered the most important Elasticsearch monitoring aspects: query and indexing performance, memory allocation and GC, host‑level system and network metrics, cluster health and node availability, and resource saturation with related errors. Monitoring these metrics helps you identify the most relevant areas for your specific workload and keep the cluster stable as it grows.

backendoperationselasticsearchmetricsPerformance MonitoringCluster Health
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.