How to Optimize Elasticsearch for Billion‑Row Queries and Sub‑Second Responses
This guide explains the background, requirements, Elasticsearch architecture, Lucene fundamentals, and practical tuning steps—including indexing, shard routing, doc values, and hardware choices—to achieve cross‑month, sub‑second query performance on datasets exceeding a billion records.
Background and Requirements
A business system generates over a hundred million rows per day, partitioned by day, but the database retains only three months of data. The goal is to support cross‑month queries for more than one year of history and return filtered query results within seconds.
Enable queries across months with >1 year of data.
Return results in seconds for filtered queries.
Elasticsearch Retrieval Principles
Elasticsearch is built on Lucene. Core components include:
Cluster : a group of nodes.
Node : a single Elasticsearch instance.
Index : logical namespace for one or more physical shards.
Document : smallest searchable unit, stored as JSON.
Shard : a Lucene index segment holding a subset of data.
Replica : copy of a shard for fault tolerance and load balancing.
Lucene stores data in segments; each segment contains multiple documents and fields. The inverted index maps terms to document IDs, while DocValues provide column‑store access for sorting, faceting, and aggregations.
Optimization Cases
Queries are field‑based only; full‑text search is not required. Data resides in HBase, with only the RowKey indexed in Elasticsearch.
Use bulk indexing with batch sizes of hundreds to thousands of records per request.
Employ multi‑threaded ingestion matching the number of machines.
Set refresh_interval": "-1" during bulk loads and manually refresh after completion.
Allocate ~50 % of system memory to Lucene file cache; use nodes with 64 GB+ RAM.
Prefer SSDs over HDD RAID for random I/O.
Use custom IDs aligned with HBase RowKey to simplify updates and deletions.
Configure segment merge throttling, e.g., indices.store.throttle.max_bytes_per_sec": "200mb", and limit merge threads based on CPU cores.
Search Performance Tuning
Disable doc values for fields that are never sorted or aggregated.
Prefer keyword type over numeric types when range queries are unnecessary.
Turn off _source for fields not needed in the response to save disk space.
Use filters instead of scoring queries when relevance scores are not required.
Pagination strategies: from + size limited by index.max_result_window": 10000. search_after for deep pagination using the last hit of the previous page. scroll for large result sets, maintaining scroll_id.
Store a combined timestamp‑ID field as long to improve sorting performance.
Allocate >16 CPU cores for heavy sorting workloads.
Set merge.policy.expunge_deletes_allowed": "0" to force deletion of marked records during merges.
Performance Testing
Single‑node test with 50 M–100 M records to gauge node capacity.
Cluster test scaling from 100 M to 3 B records, measuring disk I/O, memory, CPU, and network usage.
Benchmark varied query combinations across data volumes.
Compare SSD vs. HDD performance.
Production Results
After applying the optimizations, the platform handles tens of billions of records, returning 100 rows within 3 seconds with fast pagination. Additional nodes can be added to address future bottlenecks.
Reference: Elasticsearch indexing performance guide (https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
