Databases 11 min read

Understanding the Underlying Mechanics of Elasticsearch and Lucene

This article provides a comprehensive, top‑down and bottom‑up explanation of Elasticsearch’s internal architecture, covering clusters, nodes, shards, Lucene segments, inverted indexes, stored fields, document values, caching, merging, routing, scaling, and query processing, while addressing common performance questions.

Architecture Digest

Sep 14, 2020

Understanding the Underlying Mechanics of Elasticsearch and Lucene

Elasticsearch Overview

Elasticsearch is built on top of Lucene. A shard is essentially a Lucene index. The system consists of clusters, nodes, and shards that are distributed across multiple machines.

Cluster and Nodes

Each white square in the cloud diagram represents a node.

Shards and Segments

Green squares grouped together form an Elasticsearch index. Within an index, the green squares distributed across nodes are called shards.

A shard is fundamentally a Lucene index.

Lucene Fundamentals

Lucene is a full‑text search library; Elasticsearch leverages Lucene for its core search capabilities.

Segments (Mini‑Indexes)

Lucene stores data in many small segments, each acting like a mini‑index.

Segment Internals

Inverted Index

Stored Fields

Document Values

Cache

Inverted Index Details

The inverted index consists of a sorted dictionary of terms and their postings (the documents containing each term).

A sorted dictionary of terms and frequencies.

Postings lists linking terms to documents.

During a search, the query is tokenized, the terms are looked up in the dictionary, and the corresponding postings are retrieved.

Auto‑Completion (Prefix Search)

Binary search on the inverted index can quickly find terms starting with a given prefix, such as "c" → "choice", "coming".

Expensive Lookups

Scanning the entire inverted index for a substring (e.g., "our") is costly; generating appropriate terms is essential for optimization.

Term Generation Strategies

Suffix → reversed suffix (e.g., "suffix" → "xiffus").

Geo coordinates → GeoHash (e.g., (60.6384, 6.5017) → "u4u8gyykk").

Numbers → multiple representations (e.g., 123 → {"1‑hundreds", "12‑tens", "123"}).

Handling Misspellings

A Python library misspellings provides a tree‑based state machine to correct spelling errors.

https://pypi.python.org/pypi/misspellings

Stored Fields

When searching for exact field values (e.g., a specific title), the inverted index is insufficient; Lucene’s stored fields act as simple key‑value pairs. Elasticsearch stores the full JSON source by default.

Document Values

For sorting, aggregations, and faceting, Lucene uses column‑oriented document values, which are loaded into memory for fast access but increase memory consumption.

Search Execution

When a search is issued, Lucene queries every segment, merges the results, and returns them to the client. Key characteristics:

Segments are immutable; deletions are marked but files remain unchanged.

Updates are performed as delete‑then‑reindex.

Lucene heavily compresses data and caches information to improve query speed.

Cache Management

During indexing, Elasticsearch creates caches for each segment and periodically refreshes them. Over time, many segments accumulate and are merged, which can reduce index size due to compression.

Shard‑Level Search

Searching a shard mirrors Lucene segment search, but shards may reside on different nodes, requiring network communication. Each shard query is executed independently.

Log File Handling

Indexing logs by timestamp enables fast date‑range queries and easy deletion of old data.

Scaling Strategies

Shards cannot be split further but can be moved to other nodes. Adding nodes may require reindexing, so capacity planning should balance node count and data distribution.

Node Allocation & Shard Optimization

Allocate important indices to high‑performance machines.

Ensure each shard has a replica for high availability.

Routing

Every node maintains a routing table; incoming requests are forwarded to the appropriate shard.

Real‑World Request Flow

A request may be received by any node, which becomes the coordinator. The coordinator determines the target shards, selects available replicas, and orchestrates the query.

Query and Aggregation Example

The query uses a filtered type with a multi_match clause; aggregation groups results by author to retrieve the top‑10 authors.

Result Return Path

After execution, results travel back up the coordination chain to the originating client.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

search engine Elasticsearch sharding caching lucene scaling

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.