Elasticsearch Performance Tuning: Configuration Settings to Boost Write Throughput
This article details practical Elasticsearch tuning steps—adjusting index buffers, thread pools, refresh intervals, and translog settings—to raise average write speed from 3,000 to 8,000 documents per second and maintain stability under load.
Background
The original setup runs Elasticsearch 5.6.0 on three Alibaba Cloud ECS nodes (16 GB RAM, 4 CPU, HDD). Before optimization, write throughput averaged 3000 docs/s and dropped sharply under stress, causing GC pauses and OOM errors.
Production Configuration
Key configuration changes added to elasticsearch.yml :
indices.memory.index_buffer_size: 20%
indices.memory.min_index_buffer_size: 96mb
# Search pool
thread_pool.search.size: 5
thread_pool.search.queue_size: 100
# Bulk pool
thread_pool.bulk.size: 16
thread_pool.bulk.queue_size: 300
# Index pool
thread_pool.index.size: 16
thread_pool.index.queue_size: 300
indices.fielddata.cache.size: 40%
discovery.zen.fd.ping_timeout: 120s
discovery.zen.fd.ping_retries: 6
discovery.zen.fd.ping_interval: 30sTemplate for log indices:
PUT /_template/elk
{
"order": 6,
"template": "logstash-*",
"settings": {
"number_of_replicas": 0,
"number_of_shards": 6,
"refresh_interval": "30s",
"index.translog.durability": "async",
"index.translog.sync_interval": "30s"
}
}Optimization Parameter Details
Disable analysis for non‑text fields : set not_analyzed to avoid unnecessary tokenization.
Disable the _all field : not needed for log/APM data.
Set replica count to 0 : logs are retained for 7 days; full data lives in Hadoop, so replicas can be omitted.
Use Elasticsearch‑generated IDs : reduces version lookups.
Increase index.refresh_interval to 30 s : lowers refresh overhead when real‑time visibility isn’t required.
Limit segment merge threads to avoid heavy I/O on mechanical disks:
curl -XPUT 'your-es-host:9200/nginx_log-2018-03-20/_settings' -d '{
"index.merge.scheduler.max_thread_count" : 1
}'Allowing max_thread_count + 2 threads (i.e., 3) balances concurrency and disk I/O.
Asynchronous translog ( index.translog.durability: async and index.translog.sync_interval: 30s ) tolerates occasional data loss, which is acceptable because logs are backed up in Hadoop.
Increase memory buffers : raise indices.memory.index_buffer_size to 20 % and indices.memory.min_index_buffer_size to 96 MB, preventing frequent segment flushes.
Adjust fielddata cache to 15 % for aggregation‑heavy queries, but keep it low for log workloads where aggregations are rare.
Extend node discovery timeout settings to handle high network traffic during bulk ingestion.
Conclusion
The listed settings collectively raise average write speed to 8000 docs/s and allow the cluster to recover within 30 minutes after a stress test, with all metrics returning to normal.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.