Databases 8 min read

How Elasticsearch Stores and Retrieves Data: Inside Lucene’s Write‑Refresh‑Flush‑Merge Cycle

This article explains the fundamental architecture of Elasticsearch and its underlying Lucene engine, detailing the data model, index hierarchy, and the step‑by‑step write, refresh, flush, and merge processes that enable near‑real‑time search and data durability.

ITPUB

Sep 1, 2019

How Elasticsearch Stores and Retrieves Data: Inside Lucene’s Write‑Refresh‑Flush‑Merge Cycle

What are Elasticsearch and Lucene?

Elasticsearch is an open‑source search engine built on Apache Lucene. Lucene is a high‑performance search library that provides core indexing and query capabilities, while Elasticsearch adds a distributed RESTful API, real‑time features, and a richer query DSL.

Relationship between Elasticsearch and Lucene

Elasticsearch is implemented directly on top of Lucene; it packages Lucene’s core libraries, extends them with additional features, and exposes them via RESTful endpoints. Other projects such as Solr also use Lucene.

Elasticsearch and Lucene relationship diagram

Data model and storage hierarchy

An Elasticsearch index (e.g., a product or order search index) consists of multiple nodes.

Each node hosts one or more primary shards (P1, P2) and replica shards (R1, R2).

Each shard corresponds to a Lucene index stored on disk.

A Lucene index is a collection of segment files; each segment contains a set of documents.

Elasticsearch index, node, shard, and Lucene segment hierarchy

Lucene index structure

When a document is created, a new segment is generated and recorded in the commit point.

Search queries examine all existing segments.

Deleted documents are tracked in *.liv files.

Lucene segment and commit point illustration

Document write‑path in Elasticsearch

New or updated documents follow the sequence: write → refresh → flush → merge.

Write phase

The document is first placed in an in‑memory buffer and a translog entry is recorded.

Refresh phase

Periodically (default 1 second, configurable via index.refresh_interval) the buffered documents are flushed to a new segment that resides in the filesystem cache, making the document searchable while the translog remains unchanged.

Flush phase

Segments are persisted from the filesystem cache to disk, the translog is cleared, and the commit point is updated, ensuring durability and enabling recovery after a restart.

Merge phase

Small segments are merged into larger ones to reduce the number of files and improve search performance; old segments are deleted and *.liv files are cleaned.

Summary

The write‑refresh‑flush‑merge pipeline explains how Elasticsearch guarantees near‑real‑time search, data durability, and efficient indexing. Understanding these steps helps developers tune refresh intervals, manage translog size, and anticipate merge behavior for optimal performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

search engine lucene data indexing

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.