How Elasticsearch Writes, Reads, and Searches Data: Deep Dive into ES Internals
This article explains Elasticsearch's core mechanisms for indexing, querying, and searching data, covering the roles of coordinating nodes, primary and replica shards, refresh cycles, translog, commit/flush processes, and the underlying Lucene inverted index.
Interview Questions
What is the working principle of ES data write? What is the working principle of ES data query? Briefly introduce the underlying Lucene? Do you understand inverted index?
Interviewer Psychology
The interviewer wants to see whether you understand the basic principles of Elasticsearch, because using ES essentially means writing and searching data. If you cannot explain what ES does when you issue a write or search request, you are just using the API as a black box.
Interview Question Analysis
ES Write Process
The client selects a node and sends the request to it; this node is the coordinating node.
The coordinating node routes the document to the node that holds the primary shard.
The primary shard processes the request and replicates the data to the replica nodes.
After the primary and all replicas have processed the request, the coordinating node returns the response to the client.
ES Read Process
Reading can be done by document ID. The client sends the request to any node, which becomes the coordinating node. The coordinating node hashes the doc ID, routes the request to the appropriate node, and uses a round‑robin algorithm to select a primary or replica shard for load balancing. The selected shard returns the document to the coordinating node, which then returns it to the client.
Client sends request to any node → becomes coordinating node.
Coordinating node hashes the doc ID, routes to the appropriate node, and randomly selects a primary or replica shard.
The chosen shard returns the document to the coordinating node.
The coordinating node returns the document to the client.
ES Search Process
ES performs full‑text search. Example documents are indexed, and a query for the term “java” returns the matching documents.
java真好玩儿啊<br/>java好难学啊<br/>j2ee特别牛<br/>The client sends the request to a coordinating node.
The coordinating node forwards the search request to all primary or replica shards.
In the query phase, each shard returns matching doc IDs to the coordinating node, which merges, sorts, and paginates the results.
In the fetch phase, the coordinating node retrieves the actual documents from the shards and returns them to the client.
Write requests go to the primary shard and are synchronized to all replica shards; read requests can be served by either primary or replica shards using a random round‑robin algorithm.
Write Underlying Principle
Data is first written to an in‑memory buffer and the translog. When the buffer is near full or after a timeout, it is refresh ed into a new segment file, which first resides in the OS cache. This makes the data searchable (near‑real‑time, NRT). Every second a new segment file is created. If the buffer is empty, no refresh occurs.
Every second the buffer is refreshed to the OS cache; every five seconds the translog is flushed to disk. When the translog grows large or after 30 minutes, a commit is performed, writing a commit point to disk and fsync‑ing the OS cache. The commit operation also triggers a flush, which writes segment files to permanent storage.
The translog ensures durability: if the node crashes, ES replays the translog to recover the buffer and OS cache. By default, up to five seconds of data may be lost on crash; forcing a sync on every write eliminates this loss but hurts performance.
After a segment file is written, the inverted index is built.
Delete/Update Underlying Principle
Delete creates a .del file marking the document as deleted; updates are implemented as a delete followed by a new write. Periodic merges combine segment files, physically removing deleted documents and producing a new segment file and commit point.
Underlying Lucene
Lucene is a Java library that provides the algorithms for building inverted indexes. By adding the Lucene JAR to a project, developers can use its API to create and query indexes.
Inverted Index
An inverted index maps terms to the list of document IDs containing those terms. Example tables illustrate how documents are tokenized and how terms map to document IDs. The index also stores term frequencies and positions, enabling efficient full‑text search.
All terms map to one or more documents.
Terms are stored in lexicographic order.
The example shown does not strictly follow lexicographic ordering.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Interview Crash Guide
Dedicated to sharing Java interview Q&A; follow and reply "java" to receive a free premium Java interview guide.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
