Big Data 12 min read

Elasticsearch Indexing and Retrieval Optimization for Billion‑Scale Data

This article describes how a top architect optimized Elasticsearch for handling billions of records, covering Lucene fundamentals, index and shard design, DocValues, query performance tuning, bulk indexing strategies, hardware considerations, and testing methods to achieve sub‑second query responses across multi‑year data ranges.

Top Architect

Aug 18, 2021

Elasticsearch Indexing and Retrieval Optimization for Billion‑Scale Data

Project Background

In a business system, some tables generate over a hundred million rows per day. Data is partitioned by day, but only three months of data can be retained due to hardware limits, making cross‑month queries and long‑term data access challenging.

Improvement Goals

Enable cross‑month queries and support more than one year of historical data export.

Achieve second‑level response times for conditional queries.

Elasticsearch Retrieval Principles

3.1 Elasticsearch and Lucene Architecture

Understanding the underlying components is essential for optimization. An Elasticsearch index consists of multiple Lucene shards, each shard being a Lucene instance that stores a subset of the data.

Key concepts include:

Cluster – a group of nodes.

Node – a single server in the cluster.

Index – a logical namespace for one or more physical shards.

Shard – the smallest searchable unit, implemented as a Lucene segment.

DocValues – column‑oriented storage for fast sorting, faceting, and aggregations.

3.2 Lucene Index Implementation

Lucene stores data in segments, each containing multiple documents and fields. The index files consist of dictionaries, inverted lists, forward files, and DocValues. Optimizing these structures directly improves Elasticsearch performance.

3.3 Elasticsearch Sharding

Data is routed to a primary shard using shard = hash(routing) % number_of_primary_shards. By controlling the _routing parameter, related documents can be co‑located on the same shard, reducing query overhead.

Optimization Cases

4.1 Index Performance

Batch writes with appropriate document size (hundreds to thousands of records).

Multi‑threaded ingestion, scaling thread count with the number of machines.

Increase refresh_interval (e.g., set to -1) and manually refresh after bulk loading.

Allocate ~50% of system memory to Lucene file cache; use nodes with 64 GB+ RAM.

Prefer SSDs over HDDs for random I/O.

Use custom keys (e.g., HBase row keys) instead of auto‑generated IDs when appropriate.

Configure merge throttling and thread counts based on storage type (e.g., higher threads for SSD).

4.2 Search Performance

Disable DocValues for fields that do not require sorting or aggregations.

Prefer keyword fields over numeric types for term queries.

Disable _source storage for fields not needed in search results.

Use filters or constant_score queries to avoid scoring overhead.

Pagination strategies:

Standard from + size – limited by index.max_result_window (default 10 000). search_after – efficient deep pagination using the last hit as a cursor. scroll – for large result sets, with the trade‑off of maintaining a scroll ID.

Introduce a combined long field (timestamp + ID) to support fast sorting.

Allocate CPUs with ≥16 cores for sorting‑intensive workloads.

Set merge.policy.expunge_deletes_allowed to 0 to force immediate deletion of marked records.

Performance Testing

Single‑node tests with 50 M–100 M documents to assess node capacity.

Cluster tests ranging from 100 M to 3 B documents, monitoring disk I/O, memory, CPU, and network usage.

Benchmark various query combinations across data volumes.

Compare SSD vs. HDD performance.

Production Results

The platform now handles tens of billions of records, returning 100 rows within 3 seconds, with fast pagination. Future bottlenecks can be addressed by scaling out additional nodes.

Configuration Example

{
    "mappings": {
        "data": {
            "dynamic": "false",
            "_source": {
                "includes": ["XXX"]
            },
            "properties": {
                "state": {
                    "type": "keyword",
                    "doc_values": false
                },
                "b": {
                    "type": "long"
                }
            }
        }
    },
    "settings": {......}
}

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Elasticsearch lucene search performance Index Optimization

Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.