Backend Development 37 min read

How Elasticsearch Powers Real-Time Search: Core Concepts and Best Practices

This article provides a comprehensive overview of Elasticsearch, explaining its underlying Lucene technology, data modeling, cluster architecture, shard and replica mechanisms, indexing workflow, storage strategies, refresh and translog processes, as well as practical performance and JVM tuning tips for building scalable, near‑real‑time search solutions.

Su San Talks Tech

Apr 17, 2022

How Elasticsearch Powers Real-Time Search: Core Concepts and Best Practices

1. Data in Everyday Life

Search engines retrieve data, which can be divided into two categories: structured data (tables stored in relational databases) and unstructured data (documents, emails, images, audio, video, etc.). Unstructured data can also be considered semi‑structured when formats like XML or HTML are involved.

2. Introduction to Lucene

Relational databases cannot handle unstructured data, so full‑text search engines are needed. Apache Lucene is an open‑source library that provides inverted index capabilities. Solr and Elasticsearch are built on top of Lucene, with Elasticsearch offering built‑in distribution and easy installation.

Lucene creates an inverted index by tokenizing documents into terms and recording which documents contain each term.

Term          Doc_1   Doc_2   Doc_3
--------------------------------
Java          |   X   |        |
is            |   X   |   X    |   X
the            |   X   |   X    |   X
best           |   X   |   X    |   X
programming    |   X   |   X    |   X
language       |   X   |   X    |   X
PHP            |       |   X    |
Javascript     |       |        |   X
--------------------------------

Key terms include Term, Term Dictionary, Post List (inverted list), and Inverted File.

3. Core Concepts of Elasticsearch (ES)

Elasticsearch is a Java‑based open‑source search engine that wraps Lucene, exposing a simple RESTful API. It is a distributed, real‑time document store and analytics engine capable of handling PB‑scale structured and unstructured data.

Cluster

A cluster consists of one or more nodes sharing the same cluster.name. Nodes discover each other via Zen Discovery (default unicast). The master node manages index creation, shard allocation, and cluster state.

Node Roles

Master‑eligible node (node.master)

Data node (node.data)

Master nodes handle cluster coordination; data nodes store shards. Mixing roles can affect stability, so it is recommended to separate master‑eligible nodes from heavy data nodes.

Split‑Brain Prevention

To avoid multiple masters, Elasticsearch uses a quorum defined by discovery.zen.minimum_master_nodes. The cluster remains functional as long as a majority of master‑eligible nodes are reachable.

4. Shards and Replicas

Indexes are horizontally split into primary shards; each primary can have replica shards for high availability. The number of primary shards is fixed at index creation.

PUT /myIndex
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  }
}

Writes go to the primary shard and are then replicated to its replicas. Reads can be served by any shard copy.

5. Mapping

Mapping defines field types, analyzers, and storage options, similar to a database schema. Types include text (full‑text searchable) and keyword (exact match, suitable for sorting and aggregations). Mapping can be dynamic or explicit.

PUT my_index
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  },
  "mappings": {
    "_doc": {
      "properties": {
        "title": {"type": "text"},
        "name":  {"type": "text"},
        "age":   {"type": "integer"},
        "created": {
          "type": "date",
          "format": "strict_date_optional_time||epoch_millis"
        }
      }
    }
  }
}

6. Basic Usage

Download, unzip, and start Elasticsearch with bin/elasticsearch. By default it listens on port 9200.

{
  "name" : "U7fp3O9",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "-Rj8jGQvRIelGd9ckicUOA",
  "version" : {
    "number" : "6.8.1",
    "build_flavor" : "default",
    "build_type" : "zip",
    "build_hash" : "1fad4e1",
    "build_date" : "2019-06-18T13:16:52.517138Z",
    "lucene_version" : "7.7.0",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

Cluster Health

Health status can be green, yellow, or red, indicating full functionality, partial replica loss, or critical failures respectively.

{
  "cluster_name" : "wujiajian",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 9,
  "active_shards" : 9,
  "unassigned_shards" : 5,
  "active_shards_percent_as_number" : 64.28571428571429
}

7. Internal Mechanisms

Write Path

Documents are routed to a primary shard using shard = hash(routing) % number_of_primary_shards, where routing defaults to the document _id. The coordinating node forwards the request to the appropriate primary, which writes to disk and replicates to its replicas.

Storage Model

Indexes are stored as immutable segments on disk. Segments are written to a translog first, then periodically refreshed (creating a new segment) and flushed (fsync to disk). Deletions are recorded in .del files; updates are delete‑plus‑add.

Refresh and Flush

Refresh makes recent writes searchable (default every second). Flush persists data to disk and clears the translog (triggered when the translog exceeds 512 MB or 30 minutes).

Segment Merging

Background merges combine small segments into larger ones, reclaiming space from deleted documents and reducing the number of file handles.

8. Performance Optimization

Hardware

Use SSDs and RAID 0 for high I/O throughput.

Avoid remote network mounts (NFS, SMB).

Prefer local instance storage over cloud block storage when possible.

Index Internals

Lucene stores terms in a sorted dictionary with a compressed Finite State Transducer (FST) index, enabling fast binary search while keeping the structure memory‑efficient.

Configuration Tweaks

Use sequential, compressible IDs instead of random UUIDs.

Disable doc values on fields that are not used for sorting or aggregations.

Prefer keyword over text when full‑text search is unnecessary.

Adjust index.refresh_interval for bulk indexing (e.g., set to -1 to disable).

Set index.number_of_replicas to 0 during massive imports, then restore.

Use scroll APIs instead of deep pagination with from+size.

Limit mapping fields to those required for search, aggregation, or sorting.

Provide explicit routing values to target specific shards.

JVM Tuning

Set -Xms and -Xmx to the same value, not exceeding 50 % of physical RAM and 32 GB.

Consider G1GC over CMS for reduced stop‑the‑world pauses.

Ensure ample free memory for the operating system’s file‑system cache.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

search engine Elasticsearch sharding lucene

Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.