Operations 10 min read

Benchmarking and Sizing Your Elasticsearch Cluster for Logs and Metrics

This article explains how to assess hardware resources, calculate required Elasticsearch cluster size based on data volume, and perform indexing and search benchmark tests to ensure stable performance and optimal throughput for log and metric workloads in production environments.

Architecture Digest
Architecture Digest
Architecture Digest
Benchmarking and Sizing Your Elasticsearch Cluster for Logs and Metrics

New users can quickly set up a highly available Elasticsearch cluster, but production deployments require careful consideration of stability, throughput, performance, resource availability, scalability, and appropriate cluster size.

Hardware resources are divided into four key areas:

Disk storage: Prefer SSDs, use hot‑warm architecture to control costs, avoid RAID redundancy, and keep at least one replica shard for fault tolerance.

Memory: Allocate ~50% of RAM to the JVM for metadata and the remainder to OS cache to reduce disk reads during queries.

CPU: The number and speed of cores directly affect average operation speed and peak throughput.

Network: Bandwidth and latency impact inter‑node communication and cross‑cluster features.

Cluster sizing is driven by data volume and retention requirements. The basic formulas are:

Data total (GB) = Daily raw data (GB) × Retention days × (Replica count + 1)

Storage total (GB) = Data total × (1 + 0.15 disk‑watermark + 0.1 safety margin)

Number of data nodes = ROUNDUP(Storage total / (Memory per node × memory‑to‑data ratio)) + 1 for fail‑over.

Examples:

Small cluster: 1 GB daily data retained 9 months, 8 GB RAM per node → 3 data nodes, 675 GB total storage.

Large cluster: 100 GB daily data, 30 days hot tier, 12 months warm tier, 64 GB RAM per node → 5 hot‑tier nodes, 10 warm‑tier nodes, 91 250 GB total storage.

Benchmarking is performed with the Rally tool, focusing on indexing and search performance.

Indexing benchmark: Using a 3‑node cluster (8 vCPU, HDD, 32 GB heap) with Metricbeat data (1.2 GB, 1 079 600 docs). Optimal batch size is 12 000 documents and 16 client threads, achieving up to 62 000 indexing requests per second. Larger datasets (HTTP logs, 31.1 GB) reach 220 000 requests per second with 32 clients and 32 threads.

Search benchmark: Tested 20 OPS target with 20 clients across various query types (auto‑date‑histogram, term, range, etc.). Service times were measured for 90th percentile latency, showing longer times for certain histogram and timestamp sorts.

Conclusions: By applying the sizing formulas and running realistic benchmark tests, you can determine the appropriate number of nodes for your workload, plan for future performance, and ensure the cluster meets SLA requirements.

performanceElasticsearchMetricsBenchmarkingLogsCluster Sizing
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.