Databases 9 min read

Elasticsearch Architecture: Inverted Index, Sharding, and Data Operations

This article explains the core concepts of Elasticsearch, including how its inverted index works, the structure of term dictionaries and posting lists, shard and replica configuration, cluster node roles, the detailed write, refresh, flush, and merge processes, as well as how search queries are executed across distributed shards.

Top Architect

Dec 24, 2022

Elasticsearch Architecture: Inverted Index, Sharding, and Data Operations

Elasticsearch uses a cluster architecture similar to Kafka and Redis, where each index is backed by an inverted index that consists of a term dictionary and posting lists. The term dictionary maps each term to its posting list, which contains document IDs, term frequencies, positions, and offsets.

Each field in an Elasticsearch JSON document has its own inverted index; fields can be excluded from indexing to save storage at the cost of searchability.

Data is distributed across shards. An index can be split into multiple shard s, each having a primary shard and one or more replica shards. In a typical three‑node cluster (esnode1, esnode2, esnode3), shards are allocated across the nodes, and one node is elected as the master to manage metadata and shard allocation.

When a document is written, the client contacts a coordinating node, which determines the target primary shard using the document ID and routes the request to that shard. The primary shard writes the document and forwards the operation to its replicas; the client receives a success response only after all replicas have persisted the change.

PUT /sku_index/_settings
{
    "settings": {
        "number_of_shards" : 3,
        "number_of_replicas": 1
    }
}

Response:
{
  "acknowledged" : true
}

Elasticsearch writes are first stored in memory and recorded in a translog file. A refresh (default every 1 s) makes the in‑memory data visible to searches by creating a new segment in the filesystem cache. A flush (default every 30 min or when the translog reaches 512 MB) writes segments to disk and clears the translog.

Deletion creates a *.del file marking documents as deleted; updates are implemented as a delete followed by an insert. Periodic merge operations combine multiple segment files, physically removing deleted documents and producing a new commit point.

Search queries are processed in three stages: the coordinating node forwards the query to all relevant shards; each shard executes the query phase and returns matching document IDs; finally, the coordinating node performs the fetch phase to retrieve the full documents and merges, sorts, and paginates the results.

The article also includes practical tips, such as configuring shard count and replica number, handling node failures (master re‑election and replica promotion), and understanding the near‑real‑time nature of Elasticsearch where newly indexed data becomes searchable after the refresh interval.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Search Engine distributed architecture Sharding inverted index

Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.