Big Data 14 min read

How to Supercharge Elasticsearch for Billion‑Row Queries: Practical Optimization Guide

This article explains the architecture of Elasticsearch and Lucene, outlines common performance bottlenecks, and provides concrete indexing and search optimization techniques—including bulk writes, shard routing, doc values tuning, and pagination strategies—to achieve sub‑second query responses on billions of records.

macrozheng

Dec 20, 2019

How to Supercharge Elasticsearch for Billion‑Row Queries: Practical Optimization Guide

1. Introduction

Data platform has iterated three versions; many common challenges were encountered. This article focuses on Elasticsearch (ES) optimization, providing reference implementations to avoid pitfalls.

2. Requirement Description

Project background: a business system stores over a hundred million rows per day in daily tables, limited to three months of data in the DB, making sharding costly.

Enable cross‑month queries and support over a year of historical data export.

Return query results within seconds.

3. Elasticsearch Retrieval Principles

3.1 Basics of ES and Lucene

Understanding component fundamentals is essential for optimization. The basic ES architecture is shown below.

Key concepts:

Cluster – a group of nodes; Node – a single service unit; Index – logical namespace for one or more shards; Type – classification within an index (single type after 6.x); Document – basic indexed JSON; Shards – Lucene instances storing a subset of data (max docs 2,147,483,519); Replicas – shard copies for safety and load‑balancing.

Lucene is the core storage and retrieval engine. Its structure is illustrated below.

Lucene separates indexing (analyzers, tokenizers, filters) from searching (query parser).

A Lucene index consists of multiple segments, each containing documents and fields that are tokenized into terms.

Luke tool shows ES Lucene files, mainly adding _id and _source fields.

3.2 Lucene Index Implementation

Lucene index files include dictionary, inverted list, stored fields, DocValues, etc.

Note: source from Lucene official documentation.

Random disk reads are costly; .fdt files are large, while .tim and .doc benefit from SSD. Scoring is also expensive and can be disabled.

About DocValues:

Inverted index maps terms to doc IDs; DocValues provide column‑store access for sorting, grouping, aggregations.

Older Lucene versions used FieldCache, which was memory‑intensive and slow.

DocValues are enabled by default for all fields except analyzed strings; they can be disabled to save resources.

3.3 ES Index and Search Sharding

An ES index consists of one or more Lucene indexes, each made of segments; a segment is the smallest searchable unit.

Shard allocation: shard = hash(routing) % number_of_primary_shards By default routing is the document ID (MurmurHash3). The _routing parameter can force documents into the same shard, reducing search work.

4. Optimization Cases

In our case, queries are field‑based only, not full‑text, enabling sub‑second responses on billions of rows.

Data is stored in HBase; ES stores only the RowKey.

Reference the official tuning guide for further improvements.

4.1 Index Performance Optimizations

Bulk writes (hundreds to thousands of records per request).

Multi‑threaded writes, matching thread count to node count.

Increase segment refresh interval; set "refresh_interval": "-1" and manually refresh after bulk load.

Allocate ~50% of node memory to Lucene file cache; use nodes with 64 GB+ RAM.

Use SSDs; RAID5/10 on HDDs is slower for random I/O.

Prefer custom IDs aligned with HBase RowKey to enable efficient updates.

Control segment merging: throttle to 20 MB/s (adjust via indices.store.throttle.max_bytes_per_sec), limit merge threads, e.g., index.merge.scheduler.max_thread_count: 1 for HDD, higher for SSD.

4.2 Search Performance Optimizations

Disable doc values for fields that are not sorted or aggregated.

Prefer keyword type over numeric types for exact matches.

Disable _source storage for fields not needed in results.

Turn off scoring when not required; use filter or constant_score queries.

Pagination strategies: from + size: limited by index.max_result_window (default 10 000). search_after: use last hit’s sort values for deep paging. scroll: for large result sets, requires maintaining scroll_id.

Store a combined long field for time + ID to improve sorting.

Allocate ≥16 CPU cores for heavy sorting workloads.

Set "merge.policy.expunge_deletes_allowed": "0" to purge deleted docs during merge.

{
    "mappings": {
        "data": {
            "dynamic": "false",
            "_source": {
                "includes": ["XXX"]
            },
            "properties": {
                "state": {
                    "type": "keyword",
                    "doc_values": false
                },
                "b": {
                    "type": "long"
                }
            }
        }
    },
    "settings": {......}
}

5. Performance Testing

Benchmark before changes to gauge improvements. Tests performed:

Single‑node with 50 M–100 M records to assess point‑load capacity.

Cluster with 100 M–3 B records to measure disk I/O, memory, CPU, network usage.

Random query combinations across data volumes.

SSD vs. HDD performance comparison.

Understanding Lucene internals is crucial for effective ES tuning.

6. Production Results

The platform now handles billions of rows, returning 100 records within 3 seconds, with fast pagination; further scaling can be achieved by adding nodes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Search Engine Elasticsearch Performance Tuning Lucene

Written by

macrozheng

Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.