How Elasticsearch Writes, Reads, and Searches Data: Deep Dive into ES Internals
This article explains Elasticsearch's write, read, and search mechanisms, the role of coordinating nodes, primary and replica shards, refresh and commit cycles, Lucene's inverted index, and how data becomes searchable in near‑real‑time.
Interview Questions
What is the working principle of ES data write? What about data query? Introduce the underlying Lucene? Do you know inverted index?
Interviewer Psychology Analysis
The interviewer wants to see if you understand basic ES principles, because using ES is essentially writing and searching data. If you don't know what happens during a write or search request, you are a black box.
Interview Question Analysis
ES Write Process
Client selects a node and sends the request to a coordinating node.
The coordinating node routes the document to the node holding the primary shard.
The primary shard processes the request and synchronizes data to replica nodes.
When the primary and all replicas have completed, the coordinating node returns the response to the client.
ES Read Process
Read can be performed by doc id, which is hashed to determine the shard.
Client sends request to any node, becoming a coordinate node.
The coordinate node hashes the doc id and routes the request to the appropriate node using round‑robin among primary and replica shards for load balancing.
The receiving node returns the document to the coordinate node.
The coordinate node returns the document to the client.
ES Search Process
ES excels at full‑text search. Example documents and a keyword search.
java真好玩儿啊
java好难学啊
j2ee特别牛Searching for java returns the two documents containing that term.
Client sends request to a coordinate node.
The coordinate node forwards the search request to all shards (primary or replica).
Query phase: each shard returns matching doc ids to the coordinate node, which merges, sorts, and paginates.
Fetch phase: the coordinate node pulls the actual documents from the shards and returns them to the client.
Write requests go to the primary shard and are synchronized to all replica shards; read requests can be served by either primary or replica using round‑robin.
Underlying Write Mechanism
Data is first written to an in‑memory buffer and to the translog. When the buffer is near full or after a time interval, it is refreshed into a new segment file via the OS cache. Refresh occurs every second, making ES near‑real‑time (NRT).
Every second a new segment file is created; every five seconds the translog is flushed to disk. When the translog grows large or after 30 minutes, a commit operation flushes buffers to segment files and fsyncs them, then clears the translog.
Commit creates a commit point and writes it to disk; the translog is then cleared.
Flush (default every 30 minutes) forces the OS cache data to be fsynced to disk.
Translog ensures durability: if the machine crashes, ES replays the translog to recover data. By default, up to 5 seconds of data may be lost; forcing fsync on each write eliminates loss but hurts performance.
Data becomes searchable after refresh (≈1 s delay); up to 5 s of data may be lost on crash before being persisted.
Delete/Update Mechanism
Delete creates a .del file marking the doc as deleted; update marks the old doc as deleted and writes a new one. Segment files are merged periodically, physically removing deleted docs.
Underlying Lucene
Lucene is a Java library that provides algorithms for building inverted indexes. Developers include the Lucene JAR and use its API.
Inverted Index
An inverted index maps terms to document IDs. Example mapping shows how words map to doc IDs. Inverted indexes also store term frequencies.
Search engines use the inverted index to quickly retrieve documents matching a query term.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
