Mastering Elasticsearch: Core Concepts, Cluster Architecture, and Indexing Mechanics
This article explains Elasticsearch’s fundamental building blocks, cluster roles, shard and replica strategies, master election, split‑brain prevention, inverted index structure, and the complete search and indexing lifecycle for handling large‑scale data efficiently.
Fundamentals of Elasticsearch
Elasticsearch is an open‑source, distributed search and analytics engine built on top of Apache Lucene. It stores data as JSON documents, supports near‑real‑time search, and scales horizontally by distributing shards across multiple nodes.
Core Concepts
Index : logical namespace similar to a relational table; contains a set of mappings that define field types.
Document : a JSON‑encoded record stored in an index; identified by a unique _id.
Field : a key‑value pair inside a document; can be a primitive, object, or array.
Shard : the basic Lucene unit. An index is split into multiple primary shards; each primary can have zero or more replica shards.
Replica : a copy of a primary shard that provides fault tolerance and additional read capacity. Replicas never reside on the same node as their primary.
Cluster : a group of Elasticsearch nodes that share the same cluster name and cooperate to store and search data.
Node : a running JVM process. Nodes can assume one or more roles (master‑eligible, data, coordinating, ingest).
Node Roles
Master‑eligible node : elected via Zen Discovery; manages index metadata, shard allocation, and cluster state.
Data node : holds primary and replica shards; performs indexing, search, aggregations, and requires ample CPU, memory, and I/O.
Coordinating (gateway) node : receives client requests, routes them to the appropriate data nodes, performs result merging, and exposes cluster health.
Sharding and Replication
An index is divided into a configurable number of primary shards (default 1). The number of primary shards is fixed at index creation.
Each primary shard can have n replicas (default 1). Replicas increase read throughput and provide high availability but add indexing overhead.
Master Election and Split‑Brain Prevention
Nodes discover each other via unicast ping. The node with the lowest node.id is elected as master when a quorum is reached.
Quorum = floor(master‑eligible‑nodes / 2) + 1. Elections only occur if the number of master‑eligible nodes exceeds the quorum, preventing split‑brain scenarios where two partitions each think they have a master.
Inverted Index File Types (Lucene)
.tip– term dictionary index. .tim – postings list (inverted list). .doc – term frequencies per document. .fm – field metadata. .fdx – field index (document positions). .fdt – stored field values. .dvd / .dvm – DocValues for fast sorting and aggregations.
Search (Query) Flow
Client sends a request to a coordinating node.
The coordinating node forwards the request to the relevant shards on data nodes.
Each shard executes the query locally and returns matching document IDs together with shard metadata.
The coordinating node merges the partial results, performs a global sort, and issues a GET for the full documents.
Fetched documents are returned to the client.
Indexing Flow
Client issues create, update, or delete to a coordinating node.
The coordinating node routes the operation to the appropriate primary shard based on the routing key (default _id hash).
The primary shard writes the operation to its in‑memory buffer and translog, then forwards it to its replicas.
All replicas acknowledge the operation; the coordinating node replies with success.
Refresh, Flush, and Segment Merging
Writes are first stored in a memory buffer and recorded in the transaction log ( translog); no inverted index is built yet.
Every refresh (default 1 second) the buffer is flushed to the OS cache, a new segment is created, and the buffer is cleared, making the data visible for search.
Excessive small segments degrade performance. Elasticsearch periodically merges segments into larger ones; deletions become permanent only after a merge.
The translog is fsynced to disk every 5 seconds or after each write. A node crash may lose up to 5 seconds of data but improves indexing throughput.
A flush writes all in‑memory segments to disk, clears the translog, and is also triggered when the translog reaches its size limit (default 512 MB).
Visual Overview
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Thoughts on Knowledge and Action
Travel together, with knowledge and action all the way
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
