Understanding Elasticsearch Architecture: Inverted Index, Term Dictionary, Segments, and Distributed Search
This article explains how Elasticsearch transforms simple keyword matching into a high‑performance, scalable search engine by using inverted indexes, term dictionaries, posting lists, term indexes, stored fields, doc values, segments, and distributed node architectures to achieve fast, reliable full‑text search on massive data sets.
The article demonstrates how to move from a naïve linear scan for a keyword like "xiaobai" to an efficient, distributed search solution using Elasticsearch (ES), an open‑source search engine built on top of Lucene.
What is Elasticsearch
Elasticsearch (ES) is an open‑source search engine that sits between applications and data, allowing applications to retrieve data via keyword queries much like a web search engine.
What is an Inverted Index
An inverted index maps terms (tokens) to the document IDs that contain them. The article shows a simple example where three text snippets are tokenized into terms such as I , like , xiaobai , etc., and a posting list records which document IDs each term appears in.
I like xiaobai (like)
I follow xiaobai (follow)
I forward the video (forward)Searching for a term like xiaobai then reduces to looking up its posting list, which yields document IDs 0 and 1. To avoid linear‑time scans (O(N)), the term dictionary can be sorted and binary‑searched, achieving O(log N) lookup.
Term Dictionary and Posting List
The sorted term dictionary together with posting lists forms the core of the inverted index. The article includes a table showing terms and their associated document IDs.
term
document id
I
0, 1, 2
like
0
xiaobai
0, 1
follow
1
forward
2
the
2
video
2
Term Index
Because many terms share common prefixes (e.g., follow and forward share fo ), a term index stores only the prefix tree in memory, pointing to the on‑disk locations of full terms. This reduces memory usage while enabling fast term lookup.
Stored Fields
While the inverted index returns document IDs, the actual document contents are kept in Stored Fields , a row‑oriented storage that allows the system to retrieve the full source when needed.
Doc Values
For operations such as sorting or aggregations, fields are also stored in a column‑oriented structure called Doc Values , which enables efficient access without scanning entire documents.
Segment
A segment is the smallest searchable unit in Lucene/ES and contains its own inverted index, term index, stored fields, and doc values. Segments are immutable; new data creates new segments.
Lucene
Lucene is the single‑node text‑search library underlying ES. It stores data in immutable segments and supports concurrent reads and writes by creating new segments for updates.
High Performance
To avoid contention, ES separates data by Index Name (similar to Kafka topics) and further splits each index into multiple shards . Each shard is an independent Lucene instance, allowing parallel reads and writes.
High Scalability
Shards can be distributed across multiple nodes . Adding more nodes spreads the load, improving CPU and memory utilization.
High Availability
Each primary shard has one or more replica shards. Replicas serve read requests and can be promoted to primary if the primary fails, ensuring continuous service.
Node Role Separation
Nodes can assume specific roles: Master Node (cluster management), Data Node (stores shards), and Coordinating Node (routes client requests). In small clusters a node may play multiple roles; in larger clusters roles are separated for efficiency.
Decentralized Coordination
Instead of a central Zookeeper, ES can use a Raft‑like consensus algorithm for leader election and state synchronization, achieving a decentralized architecture.
ES vs. Kafka Architecture
The article maps ES concepts to Kafka equivalents: Index Name ↔ topic , Shard ↔ partition , Node ↔ broker , highlighting the similarity of their designs.
ES Write Process
Client sends a write request to a coordinating node.
The coordinating node hashes the request to determine the target data node and shard.
The primary shard writes the document into a new Lucene segment, creating inverted index, stored fields, and doc values.
The primary shard replicates the write to its replica shards.
After replicas acknowledge, the coordinating node returns an ACK to the client.
ES Search Process
The search consists of two phases: Query Phase and Fetch Phase.
Query Phase
Client sends a search request to a coordinating node.
The coordinating node routes the request to relevant shards based on the index name.
Each shard concurrently searches its segments using inverted indexes to obtain matching document IDs and uses doc values for sorting.
Shard results are sent back to the coordinating node, which merges and sorts them.
Fetch Phase
The coordinating node requests the full documents (stored fields) for the top‑ranked IDs from the appropriate shards.
Shards return the complete documents, and the coordinating node forwards them to the client.
In summary, Elasticsearch builds a distributed search engine by combining Lucene’s immutable segments (inverted index, term index, stored fields, doc values) with sharding, replication, node role separation, and optional decentralized coordination, achieving high performance, scalability, and availability.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.