Big Data 37 min read

Elasticsearch Architecture Overview and Core Concepts

This article provides a comprehensive overview of Elasticsearch, covering data types, Lucene fundamentals, cluster architecture, shard allocation, indexing mechanisms, storage strategies, refresh and translog processes, segment merging, performance tuning, and JVM optimization for building scalable, near‑real‑time search solutions.

Top Architect

Oct 19, 2022

Elasticsearch Architecture Overview and Core Concepts

1. Data in Real Life

Data can be divided into structured (e.g., relational tables) and unstructured (e.g., documents, images, videos). Searching these two categories leads to structured data search via relational databases and unstructured data search via full‑text techniques such as sequential scanning or inverted indexes.

2. About Lucene

Lucene is an open‑source library that provides the core inverted‑index functionality used by full‑text search engines like Solr and Elasticsearch. It creates a term dictionary and posting lists that map each term to the documents containing it.

Term          Doc_1    Doc_2   Doc_3
-------------------------------------
Java          |   X   |        |
is            |   X   |   X    |   X
the           |   X   |   X    |   X
best          |   X   |   X    |   X
programming   |   X   |   X    |   X
language      |   X   |   X    |   X
PHP           |       |   X    |   
Javascript    |       |        |   X
-------------------------------------

3. ES Core Concepts

Elasticsearch is a distributed, near‑real‑time search and analytics engine built on Lucene. It provides a RESTful API, automatic cluster management, and features such as master‑eligible nodes, data nodes, and coordinating nodes.

Cluster (Cluster)

A cluster consists of one or more nodes sharing the same cluster.name. Nodes discover each other via Zen Discovery (unicast or multicast) and elect a master node to manage metadata and shard allocation.

Node Roles

Nodes can be master‑eligible ( node.master: true) and/or data nodes ( node.data: true). Master nodes handle cluster state, while data nodes store shards and perform indexing/search operations. Coordinating nodes forward client requests to the appropriate shards.

Split‑Brain (Brain Split)

When network partitions cause multiple masters, a quorum setting ( discovery.zen.minimum_master_nodes) prevents split‑brain by requiring a minimum number of master‑eligible nodes to form a healthy cluster.

4. Shards and Replicas

Indices are divided into primary shards (default 5) and replica shards. Shards enable horizontal scaling; replicas provide high availability and load balancing. The number of replicas cannot exceed N‑1 where N is the number of data nodes.

5. Mapping

Mappings define field types (e.g., text, keyword, integer, date) and analysis behavior. Dynamic mapping infers types automatically, while explicit mapping gives precise control over indexing and search.

6. Basic Usage

Installation is a simple unzip; start the node with bin/elasticsearch. The REST API listens on port 9200. Cluster health can be checked via GET /_cluster/health, returning statuses green, yellow, or red.

7. Indexing Process

Documents are routed to a primary shard using shard = hash(routing) % number_of_primary_shards. The routing value defaults to the document _id but can be customized. The coordinating node forwards the request to the primary shard, which writes to the translog and in‑memory buffer.

Refresh and Flush

Every second (default) a refresh creates a new segment from the in‑memory buffer, making recent documents searchable. When the translog reaches 512 MB or 30 minutes, a flush writes the buffer to disk, creates a commit point, and clears the translog.

Segment Merging

Background merges combine small segments into larger ones, reclaiming space from deleted documents and reducing the number of file handles, which improves search performance.

8. Performance Optimization

Use SSDs or RAID‑0 for higher I/O throughput.

Avoid remote file systems (NFS, SMB) and overly large JVM heaps (>32 GB).

Prefer sequential IDs over random UUIDs for better compression.

Disable doc values on fields not used for sorting/aggregations.

Use keyword instead of text when full‑text analysis is unnecessary.

Adjust index.refresh_interval (e.g., 30s) or disable it ( -1) during bulk indexing.

Prefer the Scroll API over deep pagination.

Set appropriate routing values to target specific shards.

9. JVM Tuning

Set Xms and Xmx to the same value (no more than 50 % of physical RAM and ≤32 GB). Consider using the G1 garbage collector instead of CMS. Ensure ample OS file‑system cache for fast segment reads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Indexing Search Engine Elasticsearch cluster

Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.