Operations 15 min read

Mastering Elasticsearch: Practical Tuning Strategies for Performance and Cost

This article shares a detailed, experience‑driven guide on Elasticsearch tuning, covering data model fundamentals, storage cost reductions, cluster stability tricks, performance‑boosting settings, and custom kernel improvements, all illustrated with real‑world diagrams and Q&A insights.

Tencent Cloud Developer

Nov 2, 2018

Mastering Elasticsearch: Practical Tuning Strategies for Performance and Cost

In a community offline salon, a Tencent TEG infrastructure engineer presented practical Elasticsearch tuning techniques, starting with a brief overview of the Elastic Stack (Beats, Logstash, Elasticsearch, Kibana) and its core features such as high performance, scalability, reliability, and ease of management.

The speaker explained Elasticsearch’s data model: schemaless documents stored in indices, logical mapping to shards, and physical storage backed by Lucene. Each shard has a primary and replicas, and Lucene uses a transaction log, refresh, flush, and merge processes to manage writes and segment creation.

Storage Cost Optimization

Three storage forms exist in Elasticsearch: inverted index, row store, and column store. Row store keeps the original document for retrieval, while column store (doc values) accelerates aggregations and sorting. By disabling indexing on fields that are only needed for retrieval (e.g., CPU metrics) and turning off doc values where aggregation is unnecessary, row‑store costs can be cut by about 40%.

For string fields, choosing the appropriate type matters: text fields are analyzed and support full‑text search, while keyword fields are not analyzed and provide faster exact‑match queries. If only storage and retrieval are required, keyword is preferable.

Cluster Stability Enhancements

Proper shard sizing is crucial. For indices under 100 GB, 3–5 primary shards are recommended; a single shard should not exceed 50 GB. Controlling the number of replicas improves read availability without overloading the cluster.

Bulk queue size (default 100) can become a bottleneck when shard count is too high. Balancing shard count and bulk queue settings helps avoid queue saturation.

Performance Boosting Settings

Adjusting persistence parameters can reduce overhead: increase the refresh interval if near‑real‑time visibility is not required, switch translog flushing to asynchronous mode, and tune merge policies based on CPU core count to prevent excessive resource consumption.

Bulk indexing size of 1 000–10 000 documents works well for most scenarios. Omitting explicit document IDs lets Elasticsearch generate unique IDs, avoiding extra lookup overhead.

Routing can reduce shard‑level query fan‑out, and careful shard count planning prevents request rejections caused by excessive scheduling.

Using filter queries instead of scoring queries eliminates the relevance‑scoring phase and enables result caching, further speeding up frequent queries.

Custom Kernel Improvements

The team built an internal management platform offering automated cluster upgrades, plugin pre‑installation, index lifecycle management (Rollover), cross‑region disaster recovery, and dedicated master nodes to enhance stability.

Memory usage for inverted indexes was reduced by 50% through block‑wise dictionary compression and off‑heap storage techniques, allowing larger nodes (20 TB) without exceeding JVM heap limits.

Cold‑warm data separation is achieved by tagging nodes and moving aged indices to low‑cost “cold” nodes via scheduled commands, cutting storage expenses.

Stability Optimizations for Large Clusters

Rebalancing logic was refined to avoid over‑loading newly added nodes with a disproportionate number of shards. A custom throttling mechanism prevents JVM OOM during heavy write bursts, and query caching is disabled for very large result sets to reduce memory pressure.

Q&A Highlights

Common issues such as node crash recovery failures (TCP half‑open queue, shard‑failed handling) and best practices for log analysis, data retention, and permission control were addressed.

Overall, the presentation delivered a systematic checklist for reducing storage cost, improving cluster stability, and boosting performance in production Elasticsearch deployments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Operations Storage Optimization Cluster stability tuning

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.