Guidelines for Sizing and Benchmarking Elasticsearch Clusters
This article provides a comprehensive guide on allocating hardware resources, calculating cluster size based on data volume, and conducting index and search benchmark tests for Elasticsearch, offering practical formulas, test configurations, and performance conclusions to help engineers design stable, high‑throughput clusters.
1. Hardware Resource Allocation
Performance depends on the tasks Elasticsearch performs and the platform it runs on. Key resources include:
Disk Storage – Prefer SSDs, especially for nodes handling search and indexing; use hot‑warm architecture to control costs; local disks are ideal on bare metal; Elasticsearch does not require RAID, and a single replica shard is sufficient for fault tolerance.
Memory – JVM metadata should use about 50% of available RAM; the remaining RAM is used by the OS cache to speed up full‑text search, aggregations, and sorting.
CPU – The number and speed of CPU cores determine average operation speed and peak throughput.
Network – Bandwidth and latency affect inter‑node communication and cross‑cluster features.
2. Determining Cluster Size Based on Data Volume
For metric and log use cases, estimate daily raw data, retention period, and required replica count. Add 5‑10% for error margin and 15% for disk headroom, plus a spare node for redundancy.
Formulas
Data total (GB) = daily_raw_data(GB) * retention_days * (replicas + 1)
Storage total (GB) = data_total(GB) * (1 + 0.15 + 0.1)
Number of data nodes = ROUNDUP(storage_total(GB) / (node_memory / memory_to_data_ratio))
Examples:
Small cluster : 1 GB/day, 9 months retention, 8 GB RAM per node → 3 nodes.
Large cluster : 100 GB/day, 30 days hot tier, 12 months warm tier, 64 GB RAM per node → 5 hot nodes, 10 warm nodes.
3. Benchmark Testing
Before production, run benchmarks using Rally . Separate tests for indexing and searching.
4. Index Benchmark
Tested a 3‑node cluster (8 vCPU, HDD, 32 GB RAM, 16 GB JVM). Metricbeat dataset: 1.2 GB, 1,079,600 docs.
Optimal batch size ≈ 12,000 docs (≈13.7 MB) → ~13,000 indexing requests per second. Best client count = 16 → ~62,000 requests/s.
Results:
1 node, 1 shard → 22,000 req/s
2 nodes, 2 shards → 43,000 req/s
3 nodes, 3 shards → 62,000 req/s
5. Search Benchmark
Target: 20 clients, 1,000 OPS. Queries evaluated on Metricbeat and HTTP log datasets.
Observations:
For Metricbeat, auto-data-histogram-with-tz query has the longest service time.
For HTTP logs, desc_sort_timestamp and asc_sort_timestamp are slower.
Parallel queries increase 90% service time.
When indexing and searching run concurrently, service time rises noticeably.
Combined run (32 indexing clients, 20 search clients) yielded 173,000 indexing throughput and 1,000 search requests per second.
6. Conclusion
By adjusting test methodology, you can derive a systematic way to calculate required node count from data volume. Use realistic data and query patterns for benchmarking, and consider a safety margin and a spare node for future performance planning.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.