Master ElasticSearch: Core Concepts, Architecture, and Search Process Explained
This article provides a comprehensive overview of ElasticSearch, covering its role as a distributed full‑text search engine built on Lucene, key concepts such as index, type, document, field, shard and replica, the analysis pipeline, inverted index mechanics, and the two‑phase query‑fetch search workflow.
What is ElasticSearch?
ElasticSearch is a distributed full‑text search engine built on Apache Lucene, widely used in big‑data scenarios.
Core Concepts
Index : a collection of documents with similar characteristics, consisting of mapping and inverted‑index files; data may reside on one or many nodes.
Type : logical grouping of similar documents, analogous to a table in a relational database.
Document : the basic searchable unit, represented as JSON, similar to a row.
Field : the smallest unit inside a document, comparable to a column.
Shard : a slice of an index that enables horizontal scaling; each shard is a physical Lucene index.
Replica : a copy of a primary shard that provides fault tolerance and can serve read requests.
Analysis Process
ElasticSearch uses an analyzer composed of three components:
Character filter : preprocesses raw text (e.g., removes HTML tags).
Tokenizer : splits text into tokens; default behavior separates English by whitespace and Chinese by characters, with optional machine‑learning tokenizers.
Token filter : further processes tokens (e.g., lower‑casing, stop‑word removal).
Built‑in tokenizers include Standard, Simple, Stop, Whitespace, Keyword, Pattern, and language‑specific analyzers.
Inverted Index
The inverted index maps terms to the list of document IDs containing them, enabling fast full‑text search, in contrast to a forward index that maps document IDs to their content.
Search Workflow
Search executes in two phases:
Query phase
The coordinating node broadcasts the request to all relevant primary or replica shards.
Each shard performs the query locally and builds a priority queue of matching documents (size = from + size).
Shards return document IDs and scores; the coordinating node merges, sorts, and paginates the results.
Fetch phase
The coordinating node retrieves the actual document source for the selected IDs from the appropriate shards and returns the final result set to the client.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Mike Chen's Internet Architecture
Over ten years of BAT architecture experience, shared generously!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
