Backend Development 12 min read

How to Optimize Elasticsearch for Billion‑Row Queries and Sub‑Second Responses

This guide explains the background, requirements, Elasticsearch architecture, Lucene fundamentals, and practical tuning steps—including indexing, shard routing, doc values, and hardware choices—to achieve cross‑month, sub‑second query performance on datasets exceeding a billion records.

ITPUB

May 5, 2020

How to Optimize Elasticsearch for Billion‑Row Queries and Sub‑Second Responses

Background and Requirements

A business system generates over a hundred million rows per day, partitioned by day, but the database retains only three months of data. The goal is to support cross‑month queries for more than one year of history and return filtered query results within seconds.

Enable queries across months with >1 year of data.

Return results in seconds for filtered queries.

Elasticsearch Retrieval Principles

Elasticsearch is built on Lucene. Core components include:

Cluster : a group of nodes.

Node : a single Elasticsearch instance.

Index : logical namespace for one or more physical shards.

Document : smallest searchable unit, stored as JSON.

Shard : a Lucene index segment holding a subset of data.

Replica : copy of a shard for fault tolerance and load balancing.

Lucene stores data in segments; each segment contains multiple documents and fields. The inverted index maps terms to document IDs, while DocValues provide column‑store access for sorting, faceting, and aggregations.

Optimization Cases

Queries are field‑based only; full‑text search is not required. Data resides in HBase, with only the RowKey indexed in Elasticsearch.

Use bulk indexing with batch sizes of hundreds to thousands of records per request.

Employ multi‑threaded ingestion matching the number of machines.

Set refresh_interval": "-1" during bulk loads and manually refresh after completion.

Allocate ~50 % of system memory to Lucene file cache; use nodes with 64 GB+ RAM.

Prefer SSDs over HDD RAID for random I/O.

Use custom IDs aligned with HBase RowKey to simplify updates and deletions.

Configure segment merge throttling, e.g., indices.store.throttle.max_bytes_per_sec": "200mb", and limit merge threads based on CPU cores.

Search Performance Tuning

Disable doc values for fields that are never sorted or aggregated.

Prefer keyword type over numeric types when range queries are unnecessary.

Turn off _source for fields not needed in the response to save disk space.

Use filters instead of scoring queries when relevance scores are not required.

Pagination strategies: from + size limited by index.max_result_window": 10000. search_after for deep pagination using the last hit of the previous page. scroll for large result sets, maintaining scroll_id.

Store a combined timestamp‑ID field as long to improve sorting performance.

Allocate >16 CPU cores for heavy sorting workloads.

Set merge.policy.expunge_deletes_allowed": "0" to force deletion of marked records during merges.

Performance Testing

Single‑node test with 50 M–100 M records to gauge node capacity.

Cluster test scaling from 100 M to 3 B records, measuring disk I/O, memory, CPU, and network usage.

Benchmark varied query combinations across data volumes.

Compare SSD vs. HDD performance.

Production Results

After applying the optimizations, the platform handles tens of billions of records, returning 100 rows within 3 seconds with fast pagination. Additional nodes can be added to address future bottlenecks.

Reference: Elasticsearch indexing performance guide (https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

lucene

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.