How Elasticsearch Stores and Retrieves Data: Inside Lucene’s Write‑Refresh‑Flush‑Merge Cycle
This article explains the fundamental architecture of Elasticsearch and its underlying Lucene engine, detailing the data model, index hierarchy, and the step‑by‑step write, refresh, flush, and merge processes that enable near‑real‑time search and data durability.
What are Elasticsearch and Lucene?
Elasticsearch is an open‑source search engine built on Apache Lucene. Lucene is a high‑performance search library that provides core indexing and query capabilities, while Elasticsearch adds a distributed RESTful API, real‑time features, and a richer query DSL.
Relationship between Elasticsearch and Lucene
Elasticsearch is implemented directly on top of Lucene; it packages Lucene’s core libraries, extends them with additional features, and exposes them via RESTful endpoints. Other projects such as Solr also use Lucene.
Data model and storage hierarchy
An Elasticsearch index (e.g., a product or order search index) consists of multiple nodes.
Each node hosts one or more primary shards (P1, P2) and replica shards (R1, R2).
Each shard corresponds to a Lucene index stored on disk.
A Lucene index is a collection of segment files; each segment contains a set of documents.
Lucene index structure
When a document is created, a new segment is generated and recorded in the commit point.
Search queries examine all existing segments.
Deleted documents are tracked in *.liv files.
Document write‑path in Elasticsearch
New or updated documents follow the sequence: write → refresh → flush → merge.
Write phase
The document is first placed in an in‑memory buffer and a translog entry is recorded.
Refresh phase
Periodically (default 1 second, configurable via index.refresh_interval) the buffered documents are flushed to a new segment that resides in the filesystem cache, making the document searchable while the translog remains unchanged.
Flush phase
Segments are persisted from the filesystem cache to disk, the translog is cleared, and the commit point is updated, ensuring durability and enabling recovery after a restart.
Merge phase
Small segments are merged into larger ones to reduce the number of files and improve search performance; old segments are deleted and *.liv files are cleaned.
Summary
The write‑refresh‑flush‑merge pipeline explains how Elasticsearch guarantees near‑real‑time search, data durability, and efficient indexing. Understanding these steps helps developers tune refresh intervals, manage translog size, and anticipate merge behavior for optimal performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
