Databases 13 min read

Unveiling Elasticsearch: Inside Nodes, Shards, and Lucene’s Inverted Index

This article explains Elasticsearch’s internal architecture, from cloud clusters and nodes to shards and Lucene’s inverted index, covering indexing, storage structures, query processing, caching, scaling, routing, and real‑world request handling, with detailed diagrams and examples.

Programmer DD
Programmer DD
Programmer DD
Unveiling Elasticsearch: Inside Nodes, Shards, and Lucene’s Inverted Index

Abstract

This article explains Elasticsearch’s internal architecture, from cloud clusters and nodes to shards and Lucene’s inverted index, covering indexing, storage structures, query processing, caching, scaling, routing, and real‑world request handling, with detailed diagrams and examples.

Version

Elasticsearch version: elasticsearch-2.2.0

Content

Diagram of Elasticsearch

Cluster in the Cloud

Boxes in the Cluster

Each white square represents a node – Node.

Between Nodes

Multiple green squares together form an Elasticsearch index.

Small Blocks in Index

Green squares distributed across nodes are shards.

Shard = Lucene Index

A shard is essentially a Lucene index.

Diagram of Lucene

Mini Index – Segment

Lucene contains many small segments, each a mini‑index.

Segment Internals

Each segment contains several data structures:

Inverted Index

Stored Fields

Document Values

Cache

The Most Important Inverted Index

The inverted index consists of a dictionary of terms and their postings.

A sorted dictionary of terms and frequencies.

Postings that list the documents containing each term.

During a search the query is tokenized, the dictionary is consulted, and matching documents are retrieved.

Query “the fury”

Auto‑completion (Prefix)

Binary search can find terms starting with a given prefix, e.g., “c”.

Expensive Look‑ups

Scanning the entire inverted index for a substring like “our” is costly.

Problem Transformation

Possible solutions include suffix reversal, GEO hashing, and numeric token expansion.

Handling Misspellings

A Python library builds a finite‑state machine to correct spelling errors.

Stored Fields Lookup

When exact field values are needed, Lucene uses stored fields, essentially key‑value pairs; Elasticsearch stores the whole JSON source by default.

Document Values for Sorting and Aggregation

Column‑oriented structures enable efficient sorting, aggregation, and faceting, but they consume memory.

Search Execution

Lucene searches all segments, merges results, and returns them to the client. Segments are immutable; deletions are marked, updates are performed as delete‑then‑reindex.

Segments are heavily compressed.

All information is cached for fast access.

Cache Story

Elasticsearch builds caches for indexed documents and refreshes them each second.

Segments are periodically merged, which can reduce index size despite adding files.

Searching Within a Shard

Shard search mirrors Lucene segment search, but shards may reside on different nodes, requiring network transfer.

One query may hit multiple shards, each searched independently.

Log File Handling

Indexing logs by timestamp improves search speed and simplifies deletion of old data.

Scaling

Shards are not split further but can be moved to other nodes; adding nodes may require reindexing.

Node Allocation and Shard Optimization

Allocate important indices to high‑performance machines.

Ensure each shard has replica copies.

Routing

Each node holds a routing table; the coordinator node directs requests to the appropriate shard and replica.

A Real Request

Query

The query uses a filtered type with a multi_match clause.

Aggregation

Aggregates the top‑10 authors by hit count.

Request Dispatch

The request may be received by any node, which forwards it to the coordinator.

Coordinator Node

The coordinator decides routing based on index metadata and replica availability.

Determine target core node.

Select an available replica.

Routing Diagram

Pre‑Search Processing

Elasticsearch converts the query to a Lucene query, then executes it across all segments.

Filters are always cacheable; queries are cached only when scoring is required.

Return Path

Results travel back up the hierarchy to the client.

References

SlideShare: Elasticsearch From the Bottom Up

YouTube: Elasticsearch from the bottom up

Wikipedia: Document‑term matrix

Wikipedia: Search engine indexing

Skip list

Stanford: Faster postings list intersection via skip pointers

StackOverflow: How an search index works when querying many words?

StackOverflow: How does Lucene calculate intersection of documents so fast?

LinkedIn: Lucene and its magical indexes

misspellings 2.0c: A tool to detect misspellings

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

indexingluceneDistributedSearchShards
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.