Big Data 36 min read

How Elasticsearch Leverages Lucene’s Inverted Index for Real‑Time Distributed Search

This article explains the fundamentals of structured and unstructured data, introduces Lucene’s inverted index, and details how Elasticsearch builds on Lucene to provide distributed, near‑real‑time search with concepts such as clusters, shards, replicas, routing, and performance optimizations.

Efficient Ops

Dec 21, 2022

How Elasticsearch Leverages Lucene’s Inverted Index for Real‑Time Distributed Search

1. Data in Everyday Life

Data can be divided into structured data (tables stored in relational databases) and unstructured data (documents, HTML, images, audio, video, etc.). Structured data is searchable via SQL, while unstructured data requires full‑text search.

2. Introduction to Lucene

Lucene is an open‑source library that provides inverted‑index search capabilities. It is the core engine behind Solr and Elasticsearch.

Inverted indexing works by tokenizing each document into terms, building a term dictionary and a post list that maps each term to the documents containing it.

Key terms:

Term : the smallest searchable unit (a word in English, a token in Chinese).

Term Dictionary : the collection of all unique terms.

Post List : for each term, the list of document IDs and positions where it occurs.

Inverted File : the physical file storing the post lists.

3. Core Concepts of Elasticsearch (ES)

ES is a Java‑based distributed search and analytics engine built on top of Lucene. It provides a RESTful API that hides Lucene’s complexity.

Key characteristics:

Distributed real‑time document store where every field can be indexed and searched.

Scalable to hundreds of nodes and capable of handling petabytes of structured or unstructured data.

Cluster

A cluster consists of one or more nodes sharing the same cluster.name. Nodes can be master‑eligible, data, or coordinating nodes.

Discovery (Zen Discovery)

Nodes discover each other via unicast hosts defined in

discovery.zen.ping.unicast.hosts: ["host1", "host2:port"]

. The master election uses the node IDs sorted lexicographically. The setting discovery.zen.minimum_master_nodes defines the quorum required to avoid split‑brain scenarios.

Node Roles

node.master: true  // whether the node can be elected master

node.data: true    // whether the node stores data

Data nodes handle storage and heavy operations; master nodes manage cluster state and shard allocation.

Shards and Replicas

Each index is divided into a fixed number of primary shards (defined at index creation) and optional replica shards for high availability.

PUT /myIndex
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  }
}

Writes go to the primary shard and are then replicated to its replicas. The routing formula determines the target shard:

shard = hash(routing) % number_of_primary_shards

Routing defaults to the document _id but can be customized.

Mapping

Mapping defines field types, analyzers, and storage options, similar to a database schema. Example:

PUT my_index
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  },
  "mappings": {
    "_doc": {
      "properties": {
        "title": {"type": "text"},
        "name":  {"type": "text"},
        "age":   {"type": "integer"},
        "created": {
          "type": "date",
          "format": "strict_date_optional_time||epoch_millis"
        }
      }
    }
  }
}

Basic Usage

Start ES with bin/elasticsearch (default port 9200). Verify with curl http://localhost:9200/ which returns cluster information.

{
  "name": "U7fp3O9",
  "cluster_name": "elasticsearch",
  "cluster_uuid": "-Rj8jGQvRIelGd9ckicUOA",
  "version": {
    "number": "6.8.1",
    "build_flavor": "default",
    "build_type": "zip",
    "build_hash": "1fad4e1",
    "build_date": "2019-06-18T13:16:52.517138Z",
    "lucene_version": "7.7.0",
    "minimum_wire_compatibility_version": "5.6.0",
    "minimum_index_compatibility_version": "5.0.0"
  },
  "tagline": "You Know, for Search"
}

Cluster Health

{
  "cluster_name": "lzj",
  "status": "yellow",
  "number_of_nodes": 1,
  "active_primary_shards": 9,
  "active_shards": 9,
  "unassigned_shards": 5,
  "active_shards_percent_as_number": 64.28
}

Health colors: green (fully functional), yellow (primary shards OK, some replicas missing), red (some primary shards unavailable).

Write Path

Documents are first written to memory and a transaction log (translog). Periodic refresh creates a new segment visible to search; flush persists segments and clears the translog.

POST/_refresh   // refresh all indices

POST/nba/_refresh   // refresh a specific index

Refresh interval can be tuned, e.g., refresh_interval: "30s" or disabled with -1.

Segments and Merging

Segments are immutable on‑disk files. New data creates new segments; deletions are recorded in a .del file. Periodic background merges combine small segments into larger ones, reclaiming space and improving search performance.

Performance Optimizations

Use SSDs, RAID10/5, and multiple data paths; avoid remote mounts.

Choose appropriate shard count at index creation; avoid changing it later.

Prefer keyword over text for fields not needing full‑text search.

Disable doc_values on fields not used for sorting or aggregations.

Adjust index.refresh_interval and number_of_replicas during bulk indexing.

Use scroll for deep pagination instead of large from+size queries.

Set routing values to target specific shards when possible.

JVM Tuning

Set Xms and Xmx to the same value (typically ≤50% of physical RAM and ≤32 GB). Consider G1GC over CMS. Ensure sufficient heap for file‑system cache.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch Sharding Lucene replication inverted index Distributed Search

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.