Databases 9 min read

Master Elasticsearch Performance: Practical Production‑Level Optimization Guide

This guide presents a production‑grade, step‑by‑step approach to boost Elasticsearch performance, covering advanced index design, mapping best practices, query and aggregation tuning, JVM and cluster settings, bulk write optimization, monitoring, and real‑world log‑system scenarios with concrete code examples and configuration snippets.

Ray's Galactic Tech
Ray's Galactic Tech
Ray's Galactic Tech
Master Elasticsearch Performance: Practical Production‑Level Optimization Guide

Index Design (Advanced)

Shard Count Engineering Formula

Formula: target_shards ≈ total_data / 30GB

Example: 1 year of data ≈ 3TB → 3000GB → 100 primary shards. Recommended split: time‑based rollover, daily or weekly indices, 3‑6 primary shards per index.

PUT logs-2026.01.31 {
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  }
}

Core principle: more shards is not always better; query cost = cross‑shard concurrency + merge overhead.

Mapping Pitfalls (90% of performance issues)

Common mistakes

Using text for all string fields

Relying on the default text+keyword dual field

Enabling fielddata on large fields

Recommended template

"status": { "type": "keyword", "norms": false },
"title": { "type": "text", "analyzer": "ik_max_word", "search_analyzer": "ik_smart" }

High‑frequency optimizations:

Set index: false for fields not needed in search

Set doc_values: false for fields not used for sorting

Disable norms when scoring is unnecessary

Query Optimization (Quick Wins)

Proper Filter + Query Combination

{
  "query": {
    "bool": {
      "must": [{ "match": { "title": "elasticsearch" } }],
      "filter": [
        { "term": { "status": "ONLINE" } },
        { "range": { "create_time": { "gte": "now-7d" } } }
      ]
    }
  }
}

Benefits: filter cache hits, no score contribution, significant CPU reduction.

Deep Pagination Alternatives

search_after

for user paging (no global sort) scroll for full export (sequential scan) point in time for real‑time streaming (data consistency) "search_after": ["2026-01-31T10:00:00", 12345] Prohibited: from: 100000 (deep pagination).

Aggregation Performance Tricks

Control field cardinality

User ID / order number are high‑cardinality fields; avoid aggregating directly on them.

Tune

shard_size
"terms": { "field": "user_id", "size": 20, "shard_size": 200 }

Split query execution

Apply filters first

Run aggregations after filtering

Cluster & JVM Tuning (Stability Core)

JVM Heap Best Practice

-Xms16g
-Xmx16g

Reason: heap < 32GB enables compressed oops, reduces GC frequency and stabilizes latency.

Essential OS Settings

vm.max_map_count=262144
vm.swappiness=1
ulimit -n 65536
bootstrap.memory_lock: true

Circuit Breaker Settings

indices.breaker.total.limit: 70%
indices.breaker.request.limit: 40%
indices.breaker.fielddata.limit: 30%

Prevents OOM on large aggregations and JVM full GC.

Write Performance (Logs / Tracing)

Bulk API Best Practices

POST _bulk
{ "index": {} }
{ "title": "doc1" }

Batch size 5‑15 MB

Concurrent bulk controlled by thread‑pool size

Retry on failures

thread_pool.write.queue_size: 1000

Write‑time Optimizations

"refresh_interval": "30s",
"number_of_replicas": 0

After ingestion, restore replicas to 1 and refresh interval to 1 s.

Monitoring & Load‑Testing Loop

Key Monitoring Metrics

JVM GC time

Search latency P99

Segment count

Thread‑pool rejections

GET _nodes/stats
GET _cat/indices?v

Slow Query Log (must enable)

index.search.slowlog.threshold.query.warn: 1s
index.search.slowlog.threshold.fetch.warn: 500ms

Load‑testing Recommendations

Tools: JMeter, Rally

Scenarios: high‑concurrency search, large aggregations, mixed write‑query workloads

Common Pitfalls Summary

Too many shards

Aggregating on text fields

Deep pagination with from Excessively large JVM heap

Missing slow‑log configuration

Optimization order: Mapping → Query → Shards → JVM → Hardware

Scenario: Log System (High Write + Conditional Query)

Goals & Constraints

Write‑first, query‑second

Acceptable latency 30 s – 1 min

Hot data retained 7 – 30 days

Query pattern relatively fixed

Index & Lifecycle Design

Index Splitting Strategy

Time‑based rollover (mandatory)

Recommended daily indices

logs-2026.01.31
logs-2026.02.01

Shard Design (template)

Daily data < 100 GB → 3 primary shards

100 GB – 300 GB → 6 primary shards

> 300 GB → 9 – 12 primary shards

Target shard size 20 – 40 GB

ILM Policy (must)

PUT _ilm/policy/logs_policy
{
  "policy": {
    "phases": {
      "hot": { "actions": { "rollover": { "max_size": "40gb", "max_age": "1d" } } },
      "warm": { "min_age": "7d", "actions": { "forcemerge": { "max_num_segments": 1 } } },
      "delete": { "min_age": "30d", "actions": { "delete": {} } }
    }
  }
}

Mapping Template for Logs

PUT _index_template/logs_template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": { "refresh_interval": "30s" },
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },
        "level": { "type": "keyword" },
        "service": { "type": "keyword" },
        "host": { "type": "keyword" },
        "trace_id": { "type": "keyword" },
        "message": { "type": "text", "analyzer": "standard" }
      }
    }
  }
}

All filter fields use keyword; only message uses full‑text search; avoid aggregating on text fields.

Write Optimizations

Bulk batch 5‑15 MB

Concurrent bulk threads ≈ CPU cores / 2 "number_of_replicas": 0 After peak load, restore replicas to 1.

Query Template (optimal)

{
  "query": {
    "bool": {
      "filter": [
        { "term": { "service": "order-service" } },
        { "range": { "@timestamp": { "gte": "now-1h" } } }
      ],
      "must": [{ "match": { "message": "error" } }]
    }
  }
}

Strong time + service filter; only message contributes to scoring.

JVM & Node Role Recommendations

Data node: 16 – 31 GB heap

Master node: 4 – 8 GB heap

Coordinator node (dedicated for log system): 8 – 16 GB heap

Monitoring & Alerting Focus

Write rejections > 0

JVM GC > 5 %

Segment count surge

Disk usage > 70 %

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JVMMonitoringPerformanceOptimizationquery
Ray's Galactic Tech
Written by

Ray's Galactic Tech

Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.