Master Elasticsearch Monitoring: Key System, Cluster, and Resource Metrics Explained
This article expands on Elasticsearch performance monitoring by detailing host‑level system and network metrics, cluster health and node availability, resource saturation, thread‑pool behavior, cache usage, and common error indicators, offering practical guidance for maintaining a stable and efficient search cluster.
Previous article recap
The first part covered two monitoring areas: query and indexing performance, and memory allocation with garbage collection.
Host‑level system and network metrics
Beyond Elasticsearch‑specific metrics, you should collect host‑level data from each node.
Disk space
If free disk space falls below 20 % you should delete large indices (e.g., with Curator) or add nodes to redistribute shards; analyzed fields consume more space than non‑analyzed fields.
I/O utilization
Heavy segment creation, search, and merging cause significant disk reads/writes; SSDs are recommended for write‑heavy clusters.
CPU utilization
Monitor CPU usage per node type (data, master, client) with visual charts; rising CPU often indicates heavy search or indexing load, prompting alerts and possible node scaling.
Network throughput
Track bytes sent/received between nodes to ensure the network can handle replication and shard relocation traffic.
Open file descriptors
File descriptors support inter‑node communication and client connections; if usage exceeds 80 % of the limit, increase the system’s max file descriptor count (e.g., to 64 000 on Linux).
HTTP connections
All non‑Java clients use the RESTful HTTP API; a growing number of open HTTP connections may indicate missing persistent connections and added latency.
Cluster health and node availability
Cluster health status
YELLOW means at least one replica shard is unassigned; RED indicates a primary shard loss, leading to partial search results and inability to index new documents. Set alerts for prolonged YELLOW or any RED state.
Initializing and unassigned shards
Shards stay in initializing or unassigned states while the master assigns them; prolonged periods suggest cluster instability.
Resource saturation and related errors
Thread‑pool queues and rejections
Key thread pools include search, index, merge, and bulk. Queue size shows pending requests; when a pool reaches its max queue, further requests are rejected. Monitor queue growth and rejections to decide on scaling or throttling.
Bulk request queue and rejections
Bulk requests should be used instead of many single requests; rejections often stem from overly large bulk payloads. Implement back‑off strategies to handle them gracefully.
Cache usage metrics
Elasticsearch employs two main caches: fielddata (used for sorting and aggregations) and filter (used for cached filtered queries). Fielddata can consume large heap memory; limit it to ~20 % of heap and prefer doc values when possible. Filter cache behavior changed after version 2.0, with automatic eviction based on segment size and usage frequency.
Fielddata cache eviction
When fielddata reaches the configured heap limit, the least‑recently used data is evicted; consider limiting fielddata to 20 % of heap and using doc values to reduce pressure.
Filter cache eviction
Relevant only for Elasticsearch < 2.0; frequent evictions may indicate excessive creation of new filters. Optimize queries (e.g., use boolean queries) to improve cache reuse.
Pending tasks
Handled by the master node, pending tasks include index creation and shard allocation. A rising count indicates the master is overloaded and may signal cluster instability.
Failed GET requests
GET‑by‑ID failures mean the document was not found; while not common, monitoring them can reveal indexing gaps.
Conclusion
Across the two articles we covered the most important Elasticsearch monitoring aspects: query and indexing performance, memory allocation and GC, host‑level system and network metrics, cluster health and node availability, and resource saturation with related errors. Monitoring these metrics helps you identify the most relevant areas for your specific workload and keep the cluster stable as it grows.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.