Big Data 35 min read

Elasticsearch Overview: Core Concepts, Architecture, and Practical Usage

This article provides a comprehensive introduction to Elasticsearch, covering data types, Lucene fundamentals, cluster architecture, node roles, shard and replica mechanisms, mapping, installation, health monitoring, indexing principles, storage strategies, refresh and translog handling, segment merging, performance tuning, and JVM optimization for large‑scale search applications.

IT Architects Alliance

Jul 14, 2022

Elasticsearch Overview: Core Concepts, Architecture, and Practical Usage

Elasticsearch is an open‑source, distributed, real‑time search and analytics engine built on Apache Lucene, designed to handle both structured and unstructured data at PB scale.

Data in the real world is divided into structured (relational tables) and unstructured (documents, images, videos). Correspondingly, search can be performed on structured data via traditional databases or on unstructured data via full‑text search, which can be implemented through sequential scanning or inverted indexing.

Lucene provides the core inverted index functionality; Elasticsearch wraps Lucene to expose a RESTful API, adding distributed capabilities, automatic node discovery (Zen Discovery), and cluster management.

Cluster and node roles – a cluster consists of one or more nodes sharing the same cluster.name. Nodes can be master‑eligible ( node.master: true) or data nodes ( node.data: true), and any node can act as a coordinating node to route client requests.

Shard and replica model – an index is split into a fixed number of primary shards (e.g., 5) and each primary can have one or more replica shards. Shard allocation follows the formula shard = hash(routing) % number_of_primary_shards, where routing defaults to the document _id.

Mapping defines field types (e.g., text, keyword, date) and can be static (explicit) or dynamic. Proper mapping ensures correct analysis, sorting, and aggregation.

Installation and basic usage – download, unzip, and start with bin/elasticsearch. The service listens on port 9200; a simple curl http://localhost:9200/ returns cluster information. Index creation includes settings for shards, replicas, and mappings, e.g.:

PUT my_index
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  },
  "mappings": {
    "_doc": {
      "properties": {
        "title": {"type": "text"},
        "age": {"type": "integer"},
        "created": {"type": "date", "format": "strict_date_optional_time||epoch_millis"}
      }
    }
  }
}

Cluster health can be checked via GET /_cluster/health, returning statuses green, yellow, or red indicating the availability of primary and replica shards.

Write path – documents are first written to the translog and memory; a periodic refresh creates a new segment visible to searches, while a flush persists the translog to disk and creates a commit point. Segments are immutable and later merged in the background to reduce segment count.

Performance optimization includes using SSDs or RAID‑0, configuring multiple path.data directories, tuning refresh intervals, disabling replicas during bulk indexing, using routing to target specific shards, limiting deep pagination with scroll, and reducing mapping fields.

JVM tuning – set identical Xms and Xmx (≤50% of RAM, ≤32 GB), consider G1GC, allocate sufficient filesystem cache, and adjust heap size according to workload.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data Indexing Search Engine Elasticsearch

Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.