Monitoring Elasticsearch Performance: Host‑Level System and Network Metrics, Cluster Health, and Resource Saturation
This article continues the Elasticsearch performance monitoring series by detailing host‑level system and network metrics, cluster health and node availability, resource saturation, and related errors, providing practical guidance on disk space, I/O, CPU, network throughput, file descriptors, HTTP connections, thread pools, caches, pending tasks, and failed GET requests.
In the previous installment we covered query and indexing performance as well as memory allocation and garbage collection; this second part focuses on three additional monitoring categories: host‑level system and network metrics, cluster health and node availability, and resource saturation with related errors.
The content is a Chinese translation of Emily Chang's "How to monitor Elasticsearch performance" (original available on GitHub) and is reproduced from HULK technical discussions.
Host‑Level System and Network Metrics
Even though Elasticsearch exposes many internal metrics, you should also collect host‑level metrics from each node. Important areas include:
Disk space – keep at least 20% free; use Curator to delete large indices or add nodes to rebalance shards.
I/O utilization – heavy segment creation, search and merge cause intensive disk reads/writes; SSDs are recommended for write‑heavy clusters.
CPU utilization – monitor per‑node CPU usage (data, master, client) via graphs; spikes usually indicate heavy search or indexing load.
Network throughput – ensure inter‑node communication can keep up with replication and shard relocation; monitor byte‑rate metrics.
Open file descriptors – watch for limits (default 1024 on Linux); increase to values such as 64,000 for production clusters.
HTTP connections – a rising number of open HTTP connections may indicate missing persistent connections in clients.
Cluster Health and Node Availability
Cluster health states:
YELLOW – at least one replica shard is unassigned; search results are complete but further shard loss could cause data loss.
RED – at least one primary shard is missing; search returns partial results and new documents cannot be indexed. Set alerts for prolonged YELLOW or RED states.
Shards may be in initializing or unassigned states during index creation or node restart; prolonged periods indicate instability.
Resource Saturation and Related Errors
Elasticsearch uses thread pools (search, index, merge, bulk) to manage CPU and memory. Monitor queue sizes and rejections; growing queues suggest the need to throttle requests, add CPUs, or scale out the cluster.
Bulk request rejections often stem from overly large bulk payloads; implement back‑off strategies.
Cache usage:
Fielddata cache – used for sorting and aggregations; can consume large heap memory. Since version 1.3 a fielddata circuit‑breaker limits usage to 60% of heap.
Filter cache – cached filtered queries (pre‑2.0) based on segment size and frequency; monitor eviction metrics for older versions.
Prefer doc values over fielddata where possible, as they reside on disk and reduce heap pressure.
Pending Tasks
Handled only by the master node; a rising number of pending tasks indicates the master is overloaded and may affect cluster stability.
Failed GET Requests
GET‑by‑ID failures simply mean the document does not exist; monitoring them can still provide insight into application behavior.
Conclusion
Across the two articles we highlighted the most important Elasticsearch monitoring aspects for scaling and growth:
Query and indexing performance
Memory allocation and garbage collection
Host‑level system and network metrics
Cluster health and node availability
Resource saturation and related errors
By tracking these metrics you can identify the most relevant areas for your specific workload.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.