Five Key Steps to Diagnose Elasticsearch Write Performance Issues
This guide walks through five systematic checks—indexing delay vs refresh, thread‑pool rejections, segment merges and I/O, bulk size and refresh settings, and mapping/_source configuration—to pinpoint and resolve Elasticsearch write‑performance bottlenecks.
First, determine whether the latency is caused by indexing pressure or by the refresh cycle. Use
GET _cat/indices?v&h=index,pri,rep,docs.count,store.size,search.throttledand GET /_nodes/stats?pretty&filter_path=**.indexing to inspect indexing.index_current, indexing.index_failed and indexing.delete_current. A consistently high index_current or rising index_failed indicates queuing or failures in the write path.
Next, assess the impact of refresh. Temporarily increase refresh_interval (e.g., to 30 s or -1) and observe whether QPS or latency improves noticeably. If performance jumps, frequent refreshes are a bottleneck, especially for small batches with high QPS.
The second axe examines thread‑pool saturation. Query the write and index pools with
GET _cat/thread_pool/write?v&h=node_name,name,active,queue,rejected,completed,
GET _cat/thread_pool/index?v&h=node_name,name,active,queue,rejected,completedand GET _nodes/stats/thread_pool?pretty. Watch the rejected count trend and queue length. If rejections appear, a short‑term fix is to raise thread_pool.write.queue_size or thread_pool.index.queue_size; a long‑term solution is to add nodes or reduce write pressure.
The third axe focuses on segment merges and I/O. Retrieve segment information with GET _cat/segments/?v and merge statistics via GET _nodes/stats/indices/merges?pretty. High segment counts (hundreds to thousands per index) and sustained merges.current or merges.total_time_in_millis alongside high disk utilization point to merge‑related bottlenecks. Mitigations include increasing index.translog.flush_threshold_size, performing off‑peak forcemerge, or upgrading disk I/O capacity.
The fourth axe addresses bulk indexing strategy. Keep each bulk request between 5 MB and 15 MB (body size). For 1–2 KB documents, 2 000–5 000 documents per bulk are typical; larger documents require fewer items to stay within the size window. During massive imports, set refresh_interval to -1, load data, then restore the interval and issue a manual _refresh to reduce I/O and CPU spikes. Log bulk size, request count, and any 429/rejected errors; if rejections correlate with large bulk size or high concurrency, reduce bulk size or throttle concurrency.
The fifth axe reviews mapping and _source. Excessive dynamic mapping, deep nesting, or indexing fields that are never queried increase write cost. Set "index": false for non‑searchable fields and disable or limit _source (e.g., "enabled": false or use includes / excludes) when the original source is not needed, acknowledging the trade‑off with reindexing capability. During bulk loading, temporarily set number_of_replicas to 0 to avoid replica overhead, then restore the desired replica count after the load.
Finally, the article summarizes the steps in a table, recommending that routine investigations start with the first two axes (indexing delay and thread‑pool rejections) to quickly decide whether the write path is blocked or refresh/merge is the limiting factor, and then proceed through the remaining checks as needed.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Mingyi World Elasticsearch
The leading WeChat public account for Elasticsearch fundamentals, advanced topics, and hands‑on practice. Join us to dive deep into the ELK Stack (Elasticsearch, Logstash, Kibana, Beats).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
