Databases 13 min read

Understanding Elasticsearch Internals: Architecture, Lucene Indexing, Sharding, and Scaling

This article explains the internal workings of Elasticsearch, covering its cloud‑based cluster architecture, Lucene‑based indexing structures such as segments, shards, inverted indexes, stored fields and doc values, as well as search processing, caching, merging, routing, and scaling strategies.

Selected Java Interview Questions

Sep 12, 2020

Understanding Elasticsearch Internals: Architecture, Lucene Indexing, Sharding, and Scaling

Summary

This article introduces Elasticsearch’s underlying mechanisms from top‑down and bottom‑up perspectives, aiming to answer why certain queries fail, why adding more files compresses the index, and why Elasticsearch consumes a lot of memory.

Version

Elasticsearch version: elasticsearch‑2.2.0

ElasticSearch Architecture

Cluster in the Cloud

Boxes Inside the Cluster

Each white square in the cloud represents a node (Node).

Nodes and Shards

One or more green squares together form an Elasticsearch index.

Shards in an Index

Green squares distributed across nodes are shards.

Shard = Lucene Index

A shard is essentially a Lucene index.

Lucene is a full‑text search library; Elasticsearch builds on top of Lucene, and most of the following discussion explains how Elasticsearch leverages Lucene.

Lucene Overview

Mini‑Index (Segment)

Lucene contains many small segments, which can be viewed as mini‑indexes.

Segment Internals

Each segment holds several data structures:

Inverted Index

Stored Fields

Document Values

Cache

The Crucial Inverted Index

The inverted index consists of a sorted dictionary of terms and their postings (the documents containing each term). When searching, the query is tokenized, the corresponding terms are looked up in the dictionary, and the matching documents are retrieved.

Query Example: "the fury"

Auto‑Completion (Prefix)

To find terms starting with a letter, a binary search on the inverted index can locate words like "choice" or "coming".

Expensive Lookups

Finding all words containing a substring (e.g., "our") requires scanning the entire inverted index, which is costly.

Optimizing such queries involves generating suitable terms, for example:

* suffix → xiffus * (reverse the term for suffix searches)

(60.6384, 6.5017) → u4u8gyykk (geohash for geographic coordinates)

123 → {1‑hundreds, 12‑tens, 123} (multiple representations for numbers)

Spelling‑Error Handling

A Python library builds a finite‑state machine containing common misspellings to correct queries.

Stored Fields

When searching by full document titles, the inverted index is insufficient; Stored Fields provide a simple key‑value store (the JSON source of the document).

Document Values (for Sorting & Aggregation)

To support sorting, aggregations, and faceting without loading unnecessary data, Document Values store column‑oriented data, greatly improving performance but increasing memory usage.

Elasticsearch can load all Document Values of a shard into memory for fast operations, which explains its high memory consumption.

In summary, the segment contains Inverted Index, Stored Fields, Document Values, and caches.

Search Execution

During a search, Lucene scans all segments, merges their results, and returns them to the client.

Segments are immutable.

Delete marks a document as deleted without removing the file.

Update is implemented as delete + re‑index.

Lucene performs aggressive compression.

All information is cached for fast lookup.

Cache Mechanics

When indexing a document, Elasticsearch creates a cache for it and refreshes the cache every second, making the document searchable.

Over time many segments accumulate; Elasticsearch merges them, deleting the old segments, which can reduce index size despite adding more files.

Example Merge

Two segments are merged into a new one, after which the old segments are removed.

The new segment starts in a cold cache state, while most existing segments remain warm.

Searching Within Shards

Shard search follows the same process as Lucene segment search, but shards may reside on different nodes, requiring network communication.

One query may hit multiple shards, each executed independently.

Log File Handling

Indexing logs by timestamp allows fast date‑range queries and easy deletion of old data by dropping old indices.

Scaling

Shards are not split further but can be moved to different nodes; adding nodes may require re‑indexing, so capacity planning must balance node count and data distribution.

Node Allocation & Shard Optimization

Allocate important indices to high‑performance machines.

Ensure each shard has replica copies.

Routing

Every node holds a routing table; any incoming request can be forwarded to the appropriate shard.

A Real Request Flow

Query

The query uses a filtered type with a multi_match clause.

Aggregation

Aggregates by author to obtain the top‑10 hits per author.

Request Dispatch

The request may be sent to any node in the cluster.

Coordinator Node

The coordinator decides which primary and replica shards should handle the request.

Routing Path

Pre‑Search Processing

Elasticsearch converts the query to a Lucene query, then executes it across all segments.

Filters are cached; queries are not cached, so applications must implement their own query caching if needed.

Result Return

After execution, results travel back up the node hierarchy to the client.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data indexing search engine Elasticsearch sharding lucene

Written by

Selected Java Interview Questions

A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.