Fundamentals 13 min read

How Elasticsearch Writes, Reads, and Searches Data: Deep Dive into ES Internals

This article explains Elasticsearch's core mechanisms for indexing, querying, and searching data, covering the roles of coordinating nodes, primary and replica shards, refresh cycles, translog, commit/flush processes, and the underlying Lucene inverted index.

Java Interview Crash Guide

Sep 23, 2021

How Elasticsearch Writes, Reads, and Searches Data: Deep Dive into ES Internals

Interview Questions

What is the working principle of ES data write? What is the working principle of ES data query? Briefly introduce the underlying Lucene? Do you understand inverted index?

Interviewer Psychology

The interviewer wants to see whether you understand the basic principles of Elasticsearch, because using ES essentially means writing and searching data. If you cannot explain what ES does when you issue a write or search request, you are just using the API as a black box.

Interview Question Analysis

ES Write Process

The client selects a node and sends the request to it; this node is the coordinating node.

The coordinating node routes the document to the node that holds the primary shard.

The primary shard processes the request and replicates the data to the replica nodes.

After the primary and all replicas have processed the request, the coordinating node returns the response to the client.

ES Read Process

Reading can be done by document ID. The client sends the request to any node, which becomes the coordinating node. The coordinating node hashes the doc ID, routes the request to the appropriate node, and uses a round‑robin algorithm to select a primary or replica shard for load balancing. The selected shard returns the document to the coordinating node, which then returns it to the client.

Client sends request to any node → becomes coordinating node.

Coordinating node hashes the doc ID, routes to the appropriate node, and randomly selects a primary or replica shard.

The chosen shard returns the document to the coordinating node.

The coordinating node returns the document to the client.

ES Search Process

ES performs full‑text search. Example documents are indexed, and a query for the term “java” returns the matching documents.

java真好玩儿啊<br/>java好难学啊<br/>j2ee特别牛<br/>

The client sends the request to a coordinating node.

The coordinating node forwards the search request to all primary or replica shards.

In the query phase, each shard returns matching doc IDs to the coordinating node, which merges, sorts, and paginates the results.

In the fetch phase, the coordinating node retrieves the actual documents from the shards and returns them to the client.

Write requests go to the primary shard and are synchronized to all replica shards; read requests can be served by either primary or replica shards using a random round‑robin algorithm.

Write Underlying Principle

Data is first written to an in‑memory buffer and the translog. When the buffer is near full or after a timeout, it is refresh ed into a new segment file, which first resides in the OS cache. This makes the data searchable (near‑real‑time, NRT). Every second a new segment file is created. If the buffer is empty, no refresh occurs.

Every second the buffer is refreshed to the OS cache; every five seconds the translog is flushed to disk. When the translog grows large or after 30 minutes, a commit is performed, writing a commit point to disk and fsync‑ing the OS cache. The commit operation also triggers a flush, which writes segment files to permanent storage.

The translog ensures durability: if the node crashes, ES replays the translog to recover the buffer and OS cache. By default, up to five seconds of data may be lost on crash; forcing a sync on every write eliminates this loss but hurts performance.

After a segment file is written, the inverted index is built.

Delete/Update Underlying Principle

Delete creates a .del file marking the document as deleted; updates are implemented as a delete followed by a new write. Periodic merges combine segment files, physically removing deleted documents and producing a new segment file and commit point.

Underlying Lucene

Lucene is a Java library that provides the algorithms for building inverted indexes. By adding the Lucene JAR to a project, developers can use its API to create and query indexes.

Inverted Index

An inverted index maps terms to the list of document IDs containing those terms. Example tables illustrate how documents are tokenized and how terms map to document IDs. The index also stores term frequencies and positions, enabling efficient full‑text search.

All terms map to one or more documents.

Terms are stored in lexicographic order.

The example shown does not strictly follow lexicographic ordering.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Search Engine Elasticsearch Lucene inverted index data ingestion near real-time

Written by

Java Interview Crash Guide

Dedicated to sharing Java interview Q&A; follow and reply "java" to receive a free premium Java interview guide.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.