Databases 8 min read

How Elasticsearch Stores and Retrieves Data: Inside Lucene’s Write‑Refresh‑Flush‑Merge Cycle

This article explains the fundamental architecture of Elasticsearch and its underlying Lucene engine, detailing the data model, index hierarchy, and the step‑by‑step write, refresh, flush, and merge processes that enable near‑real‑time search and data durability.

ITPUB
ITPUB
ITPUB
How Elasticsearch Stores and Retrieves Data: Inside Lucene’s Write‑Refresh‑Flush‑Merge Cycle

What are Elasticsearch and Lucene?

Elasticsearch is an open‑source search engine built on Apache Lucene. Lucene is a high‑performance search library that provides core indexing and query capabilities, while Elasticsearch adds a distributed RESTful API, real‑time features, and a richer query DSL.

Elasticsearch architecture diagram
Elasticsearch architecture diagram

Relationship between Elasticsearch and Lucene

Elasticsearch is implemented directly on top of Lucene; it packages Lucene’s core libraries, extends them with additional features, and exposes them via RESTful endpoints. Other projects such as Solr also use Lucene.

Elasticsearch and Lucene relationship diagram
Elasticsearch and Lucene relationship diagram

Data model and storage hierarchy

An Elasticsearch index (e.g., a product or order search index) consists of multiple nodes.

Each node hosts one or more primary shards (P1, P2) and replica shards (R1, R2).

Each shard corresponds to a Lucene index stored on disk.

A Lucene index is a collection of segment files; each segment contains a set of documents.

Elasticsearch index, node, shard, and Lucene segment hierarchy
Elasticsearch index, node, shard, and Lucene segment hierarchy

Lucene index structure

When a document is created, a new segment is generated and recorded in the commit point.

Search queries examine all existing segments.

Deleted documents are tracked in *.liv files.

Lucene segment and commit point illustration
Lucene segment and commit point illustration

Document write‑path in Elasticsearch

New or updated documents follow the sequence: write → refresh → flush → merge.

Write phase

The document is first placed in an in‑memory buffer and a translog entry is recorded.

Write phase diagram
Write phase diagram

Refresh phase

Periodically (default 1 second, configurable via index.refresh_interval) the buffered documents are flushed to a new segment that resides in the filesystem cache, making the document searchable while the translog remains unchanged.

Refresh phase diagram
Refresh phase diagram

Flush phase

Segments are persisted from the filesystem cache to disk, the translog is cleared, and the commit point is updated, ensuring durability and enabling recovery after a restart.

Flush phase diagram
Flush phase diagram

Merge phase

Small segments are merged into larger ones to reduce the number of files and improve search performance; old segments are deleted and *.liv files are cleaned.

Merge phase diagram
Merge phase diagram

Summary

The write‑refresh‑flush‑merge pipeline explains how Elasticsearch guarantees near‑real‑time search, data durability, and efficient indexing. Understanding these steps helps developers tune refresh intervals, manage translog size, and anticipate merge behavior for optimal performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

search enginelucenedata indexing
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.