Big Data 13 min read

Elasticsearch Indexing and Search Optimization: Principles, Lucene Internals, and Performance Tuning

This article explains the architecture and core concepts of Elasticsearch and Lucene, outlines the requirements for cross‑month and high‑speed queries on massive datasets, and provides detailed index and search performance tuning techniques—including bulk writes, shard routing, doc‑values management, and pagination strategies—to achieve sub‑second response times on billions of records.

Top Architect
Top Architect
Top Architect
Elasticsearch Indexing and Search Optimization: Principles, Lucene Internals, and Performance Tuning

1. Introduction

The data platform has undergone three iterations, encountering many common challenges; this article shares refined documentation focusing on Elasticsearch (ES) optimization, with brief references to HBase and Hadoop design.

2. Requirement Description

Project background

In a business system, some tables generate over 100 million rows per day. Data is partitioned by day, but queries are limited to daily scope and the database can retain only three months of data, making sharding costly.

Improved version goals

Enable cross‑month queries and support more than one year of historical data export.

Condition‑based queries should return within seconds.

3. Elasticsearch Retrieval Principles

3.1 Basic Structure of ES and Lucene

Understanding component fundamentals is essential for optimization. ES is built on Lucene; the diagram (omitted) shows the architecture.

Cluster: a set of Nodes.

Node: a service unit.

Index: logical namespace for one or more physical shards.

Type: classification within an index (single type after 6.x).

Document: the smallest indexable unit, e.g., a JSON string.

Shard: a working unit that stores a subset of data; each shard is a Lucene instance.

Replica: copy of a shard for safety and load balancing.

Lucene separates indexing (analyzers, tokenizers, filters) from searching (query parser). A Lucene index consists of multiple segments, each containing documents and fields.

3.2 Lucene Index Implementation

Key file types include dictionary, inverted list, forward file, and DocValues (diagram omitted).

Source: Lucene official documentation.

Random disk reads are costly; .fdt files consume space, .tim and .doc benefit from SSD. Scoring also consumes resources and can be disabled if not needed.

About DocValues

DocValues provide column‑store access to field values for sorting, grouping, and aggregation, avoiding expensive inverted‑list scans.

In ES, DocValues are enabled for all fields except analyzed strings; they can be turned off to save resources.

3.3 ES Index and Search Sharding

An ES index consists of one or more Lucene indexes, each made of segments. Document routing determines the shard: shard = hash(routing) % number_of_primary_shards. By default routing uses the document ID (MurmurHash3), but a custom _routing parameter can force related documents onto the same shard, reducing search work.

4. Optimization Cases

In our case, queries are field‑based only, not full‑text, enabling sub‑second responses on billions of rows.

ES stores only the fields needed for search; actual data resides in HBase RowKey.

Reference to official ES indexing‑speed tuning guide.

4.1 Index Performance Optimizations

Bulk writes (hundreds to thousands of records per request).

Multi‑threaded writes matching the number of machines.

Increase segment refresh interval (e.g., set "refresh_interval": "-1" and manually refresh after bulk load).

Allocate ~50% of system memory to Lucene file cache; nodes should have ≥64 GB RAM.

Use SSDs instead of HDD RAID for random I/O.

Prefer auto‑generated IDs; in our case we use custom keys aligned with HBase RowKey.

Tune merge throttling and thread count (e.g., "indices.store.throttle.max_bytes_per_sec": "200mb", index.merge.scheduler.max_thread_count).

4.2 Search Performance Optimizations

Disable doc values for fields that do not need sorting or aggregation.

Prefer keyword over numeric types for term queries.

Disable _source storage for fields not required in results.

Use filters or constant_score queries to avoid scoring when not needed.

Pagination: avoid deep from+size queries; use search_after or scroll for large result sets.

Introduce a combined long field for time‑ID to improve sorting.

Allocate ≥16 CPU cores for sorting‑heavy workloads.

Set "merge.policy.expunge_deletes_allowed": "0" to purge deleted docs during merge.

{
  "mappings": {
    "data": {
      "dynamic": "false",
      "_source": {
        "includes": ["XXX"]
      },
      "properties": {
        "state": {
          "type": "keyword",
          "doc_values": false
        },
        "b": {
          "type": "long"
        }
      }
    }
  },
  "settings": { ... }
}

5. Performance Testing

Benchmark before and after changes. Example tests:

Single‑node with 50 M–100 M records to assess point capacity.

Cluster with 100 M–3 B records to evaluate disk I/O, memory, CPU, and network.

Random query combinations at various data volumes.

SSD vs. HDD performance comparison.

Testing is time‑consuming but essential for identifying bottlenecks.

6. Production Results

The platform now handles tens of billions of rows, returning 100 records within 3 seconds, with fast pagination. Future bottlenecks can be addressed by scaling nodes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big Datasearch engineElasticsearchperformance tuningluceneIndex Optimization
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.