Operations 14 min read

Monitoring Elasticsearch Performance: Host‑Level System and Network Metrics, Cluster Health, and Resource Saturation

This article continues the Elasticsearch performance monitoring series by detailing host‑level system and network metrics, cluster health and node availability, resource saturation, and related errors, providing practical guidance on disk space, I/O, CPU, network throughput, file descriptors, HTTP connections, thread pools, caches, pending tasks, and failed GET requests.

360 Tech Engineering

Aug 29, 2018

Monitoring Elasticsearch Performance: Host‑Level System and Network Metrics, Cluster Health, and Resource Saturation

In the previous installment we covered query and indexing performance as well as memory allocation and garbage collection; this second part focuses on three additional monitoring categories: host‑level system and network metrics, cluster health and node availability, and resource saturation with related errors.

The content is a Chinese translation of Emily Chang's "How to monitor Elasticsearch performance" (original available on GitHub) and is reproduced from HULK technical discussions.

Host‑Level System and Network Metrics

Even though Elasticsearch exposes many internal metrics, you should also collect host‑level metrics from each node. Important areas include:

Disk space – keep at least 20% free; use Curator to delete large indices or add nodes to rebalance shards.

I/O utilization – heavy segment creation, search and merge cause intensive disk reads/writes; SSDs are recommended for write‑heavy clusters.

CPU utilization – monitor per‑node CPU usage (data, master, client) via graphs; spikes usually indicate heavy search or indexing load.

Network throughput – ensure inter‑node communication can keep up with replication and shard relocation; monitor byte‑rate metrics.

Open file descriptors – watch for limits (default 1024 on Linux); increase to values such as 64,000 for production clusters.

HTTP connections – a rising number of open HTTP connections may indicate missing persistent connections in clients.

Cluster Health and Node Availability

Cluster health states:

YELLOW – at least one replica shard is unassigned; search results are complete but further shard loss could cause data loss.

RED – at least one primary shard is missing; search returns partial results and new documents cannot be indexed. Set alerts for prolonged YELLOW or RED states.

Shards may be in initializing or unassigned states during index creation or node restart; prolonged periods indicate instability.

Resource Saturation and Related Errors

Elasticsearch uses thread pools (search, index, merge, bulk) to manage CPU and memory. Monitor queue sizes and rejections; growing queues suggest the need to throttle requests, add CPUs, or scale out the cluster.

Bulk request rejections often stem from overly large bulk payloads; implement back‑off strategies.

Cache usage:

Fielddata cache – used for sorting and aggregations; can consume large heap memory. Since version 1.3 a fielddata circuit‑breaker limits usage to 60% of heap.

Filter cache – cached filtered queries (pre‑2.0) based on segment size and frequency; monitor eviction metrics for older versions.

Prefer doc values over fielddata where possible, as they reside on disk and reduce heap pressure.

Pending Tasks

Handled only by the master node; a rising number of pending tasks indicates the master is overloaded and may affect cluster stability.

Failed GET Requests

GET‑by‑ID failures simply mean the document does not exist; monitoring them can still provide insight into application behavior.

Conclusion

Across the two articles we highlighted the most important Elasticsearch monitoring aspects for scaling and growth:

Query and indexing performance

Memory allocation and garbage collection

Host‑level system and network metrics

Cluster health and node availability

Resource saturation and related errors

By tracking these metrics you can identify the most relevant areas for your specific workload.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Elasticsearch performance monitoring Cluster health

Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.