Unveiling Elasticsearch: Inside Nodes, Shards, and Lucene’s Inverted Index
This article explains Elasticsearch’s internal architecture, from cloud clusters and nodes to shards and Lucene’s inverted index, covering indexing, storage structures, query processing, caching, scaling, routing, and real‑world request handling, with detailed diagrams and examples.
Abstract
This article explains Elasticsearch’s internal architecture, from cloud clusters and nodes to shards and Lucene’s inverted index, covering indexing, storage structures, query processing, caching, scaling, routing, and real‑world request handling, with detailed diagrams and examples.
Version
Elasticsearch version: elasticsearch-2.2.0
Content
Diagram of Elasticsearch
Cluster in the Cloud
Boxes in the Cluster
Each white square represents a node – Node.
Between Nodes
Multiple green squares together form an Elasticsearch index.
Small Blocks in Index
Green squares distributed across nodes are shards.
Shard = Lucene Index
A shard is essentially a Lucene index.
Diagram of Lucene
Mini Index – Segment
Lucene contains many small segments, each a mini‑index.
Segment Internals
Each segment contains several data structures:
Inverted Index
Stored Fields
Document Values
Cache
The Most Important Inverted Index
The inverted index consists of a dictionary of terms and their postings.
A sorted dictionary of terms and frequencies.
Postings that list the documents containing each term.
During a search the query is tokenized, the dictionary is consulted, and matching documents are retrieved.
Query “the fury”
Auto‑completion (Prefix)
Binary search can find terms starting with a given prefix, e.g., “c”.
Expensive Look‑ups
Scanning the entire inverted index for a substring like “our” is costly.
Problem Transformation
Possible solutions include suffix reversal, GEO hashing, and numeric token expansion.
Handling Misspellings
A Python library builds a finite‑state machine to correct spelling errors.
Stored Fields Lookup
When exact field values are needed, Lucene uses stored fields, essentially key‑value pairs; Elasticsearch stores the whole JSON source by default.
Document Values for Sorting and Aggregation
Column‑oriented structures enable efficient sorting, aggregation, and faceting, but they consume memory.
Search Execution
Lucene searches all segments, merges results, and returns them to the client. Segments are immutable; deletions are marked, updates are performed as delete‑then‑reindex.
Segments are heavily compressed.
All information is cached for fast access.
Cache Story
Elasticsearch builds caches for indexed documents and refreshes them each second.
Segments are periodically merged, which can reduce index size despite adding files.
Searching Within a Shard
Shard search mirrors Lucene segment search, but shards may reside on different nodes, requiring network transfer.
One query may hit multiple shards, each searched independently.
Log File Handling
Indexing logs by timestamp improves search speed and simplifies deletion of old data.
Scaling
Shards are not split further but can be moved to other nodes; adding nodes may require reindexing.
Node Allocation and Shard Optimization
Allocate important indices to high‑performance machines.
Ensure each shard has replica copies.
Routing
Each node holds a routing table; the coordinator node directs requests to the appropriate shard and replica.
A Real Request
Query
The query uses a filtered type with a multi_match clause.
Aggregation
Aggregates the top‑10 authors by hit count.
Request Dispatch
The request may be received by any node, which forwards it to the coordinator.
Coordinator Node
The coordinator decides routing based on index metadata and replica availability.
Determine target core node.
Select an available replica.
Routing Diagram
Pre‑Search Processing
Elasticsearch converts the query to a Lucene query, then executes it across all segments.
Filters are always cacheable; queries are cached only when scoring is required.
Return Path
Results travel back up the hierarchy to the client.
References
SlideShare: Elasticsearch From the Bottom Up
YouTube: Elasticsearch from the bottom up
Wikipedia: Document‑term matrix
Wikipedia: Search engine indexing
Skip list
Stanford: Faster postings list intersection via skip pointers
StackOverflow: How an search index works when querying many words?
StackOverflow: How does Lucene calculate intersection of documents so fast?
LinkedIn: Lucene and its magical indexes
misspellings 2.0c: A tool to detect misspellings
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
