How to Optimize Elasticsearch for Billions of Records: Practical Tuning Guide
An in‑depth guide walks through Elasticsearch’s underlying Lucene architecture, explains shard routing and DocValues, then presents concrete index‑ and search‑performance tweaks—bulk writes, refresh intervals, memory allocation, SSD usage, field mapping, pagination strategies—and shows benchmark results that reduce query latency to seconds for billions of records.
1. Introduction
The data platform has gone through three versions; this article shares documentation that focuses on Elasticsearch optimization, with brief references to HBase and Hadoop design.
2. Requirements
Project background: a business system stores over a billion rows per day in daily‑partitioned tables, limited to three months of data retention due to hardware constraints, making additional sharding costly. The improvement goals are cross‑month queries with over one year of history export and sub‑second query responses.
3. Elasticsearch Retrieval Principles
1. ES and Lucene Basic Architecture
Key concepts include Cluster (multiple Nodes), Node (service unit), Index (logical namespace of shards), Type (single type per index after 6.x), Document (basic JSON unit), Shard (working unit storing a subset of data), and Replica (shard backup for safety and load distribution).
2. Lucene Index Structure
Lucene stores data in segments; each segment contains multiple documents, each document contains fields that are tokenized into terms. The index consists of a dictionary, inverted list, forward file, and DocValues. Diagrams illustrate the file layout and highlight that .fdt files are space‑heavy, while .tim and .doc benefit from SSD storage for random reads.
3. DocValues
DocValues provide column‑store structures that enable fast sorting, faceting, and aggregation by mapping document IDs to field values. ES enables DocValues for all fields except those marked as analyzed strings; disabling them for unused fields reduces memory and CPU consumption.
4. ES Shard Routing
Data placement follows shard = hash(routing) % number_of_primary_shards. By default, routing uses the document ID (MurmurHash3), but the _routing parameter can force related documents onto the same shard, decreasing the amount of data each shard must search.
4. Optimization Cases
In this case, queries are limited to specific fields (no full‑text search), which is a prerequisite for achieving sub‑second responses on billions of rows.
ES stores only the HBase rowkey; actual data resides in HBase.
Reference the official Elasticsearch tuning guide for additional settings.
4.1 Index Performance Improvements
Bulk writes: ingest hundreds to thousands of records per request.
Multi‑threaded writes: match the number of threads to the number of machines; monitor performance via Kibana.
Increase segment refresh interval (e.g., set "refresh_interval": "-1") and manually refresh after bulk loading.
Allocate roughly 50 % of node memory to Lucene file cache; use nodes with ≥64 GB RAM for heavy workloads.
Prefer SSDs over mechanical RAID for random I/O.
Use custom keys aligned with HBase rowkeys instead of auto‑generated IDs to simplify updates and deletions.
Configure merge throttling, e.g., "indices.store.throttle.max_bytes_per_sec": "200mb", based on disk performance.
Adjust merge thread count: for SSDs, a higher count (e.g., 6) is acceptable; for HDDs, limit to 1.
4.2 Search Performance Improvements
Disable DocValues for fields that never require sorting or aggregation.
Prefer keyword type over numeric types for exact match queries; numeric types incur extra overhead.
Disable _source storage for fields not needed in query results.
Turn off scoring when irrelevant; use filter or constant_score queries to set score to 0 or 1.
Pagination: avoid deep from+size (default index.max_result_window = 10000); use search_after for efficient deep paging or scroll for large result sets.
Introduce a combined long field that stores timestamp + ID for fast sorting without noticeable performance loss.
Allocate ≥16 CPU cores for workloads that involve sorting.
Set "merge.policy.expunge_deletes_allowed": "0" to purge deleted documents during merges.
{
"mappings": {
"data": {
"dynamic": "false",
"_source": {
"includes": ["XXX"]
},
"properties": {
"state": {
"type": "keyword",
"doc_values": false
},
"b": {
"type": "long"
}
}
}
},
"settings": { ... }
}5. Performance Testing
Benchmarking before changes includes:
Single‑node test with 50 M–100 M records to assess point‑load capacity.
Cluster test with 100 M–3 B records to measure disk I/O, memory, CPU, and network consumption.
Random query combinations to evaluate response times at various data volumes.
Comparison of SSD versus mechanical disk performance.
6. Production Results
After applying the optimizations, the platform remains stable with tens of billions of rows; typical queries returning 100 rows complete within three seconds, and pagination stays fast. Future bottlenecks can be mitigated by adding nodes to distribute load.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
