Backend Development 39 min read

Understanding Elasticsearch: Core Concepts, Architecture, and Performance Tips

This article provides a comprehensive overview of Elasticsearch, covering data types, Lucene fundamentals, cluster discovery, node roles, shard and replica management, mapping, installation, health monitoring, indexing mechanics, storage strategies, refresh and translog processes, segment merging, and practical performance optimizations for production deployments.

IT Architects Alliance

Apr 10, 2022

Understanding Elasticsearch: Core Concepts, Architecture, and Performance Tips

Data in Everyday Life

Data can be broadly classified into structured data (e.g., relational tables) and unstructured data (e.g., documents, images, videos). Structured data is searchable via SQL, while unstructured data requires full‑text search techniques.

Lucene Basics

Lucene is an open‑source Java library that provides the core inverted‑index functionality used by many search engines. It is not a complete search engine by itself; higher‑level products such as Solr and Elasticsearch build on top of Lucene.

Inverted indexing works by tokenizing each document into terms and creating a term dictionary that maps each term to a posting list of documents where it appears. The following table illustrates a simple inverted index for three sample sentences:

Term          Doc_1  Doc_2  Doc_3
---------------------------------
Java          X
is            X      X      X
the           X      X      X
best          X      X      X
programming   X      X      X
language      X      X      X
PHP                     X
Javascript                     X

Elasticsearch Core Concepts

Elasticsearch is a distributed, near‑real‑time search and analytics engine built in Java. It wraps Lucene to hide its complexity and exposes a RESTful API. Key characteristics include:

Distributed document store where every field can be indexed and searched.

Real‑time analytics capabilities.

Horizontal scalability to hundreds of nodes, handling petabyte‑scale data.

Cluster and Discovery

A cluster consists of one or more nodes that share the same cluster.name. Nodes discover each other via the built‑in Zen Discovery module, which uses unicast host lists (configurable in elasticsearch.yml) to form the cluster and elect a master node.

To avoid split‑brain scenarios, Elasticsearch relies on a quorum defined by discovery.zen.minimum_master_nodes. The master node must be reachable by a majority of master‑eligible nodes before it can accept write operations.

Node Roles

Each node can act as a master‑eligible node, a data node, or both. Master‑eligible nodes participate in elections; data nodes store shards and handle indexing/search workloads. It is recommended to separate master‑eligible nodes onto lightweight machines to reduce resource contention.

Split‑Brain Prevention

Split‑brain occurs when network partitions cause multiple masters to be elected, leading to data inconsistency. Configuring an appropriate quorum and separating master roles mitigates this risk.

Sharding and Replication

Indices are divided into primary shards; each primary can have multiple replica shards. Sharding enables horizontal scaling, while replicas provide high availability and increase read throughput. The number of primary shards is fixed at index creation because routing calculations depend on it:

shard = hash(routing) % number_of_primary_shards

Routing defaults to the document _id but can be customized.

Mapping

Mapping defines how fields are stored and indexed. Elasticsearch supports dynamic mapping (automatic type detection) and explicit mapping (user‑defined field types). Common field types in ES 6.8 include: text: analyzed full‑text fields. keyword: exact‑value fields for filtering, sorting, and aggregations. integer, date, etc.

Explicit mapping is preferred when precise control over analysis, indexing, and storage is required.

Basic Usage

After downloading and extracting Elasticsearch, start it with bin/elasticsearch. The default HTTP port is 9200. A simple curl http://localhost:9200/ returns cluster information in JSON.

Cluster Health

Cluster health can be queried via GET /_cluster/health, returning a status of green, yellow, or red indicating overall shard allocation and availability.

Indexing Mechanics

Write Path

When a document is indexed, the coordinating node computes the target primary shard using the routing formula, forwards the request to that primary, which writes to its transaction log (translog) and stores the document in memory. The primary then replicates the operation to its replicas.

Storage Model

Elasticsearch stores data on disk as immutable segments . Each segment contains its own inverted index. New documents are first written to memory; when a refresh interval (default 1 s) expires, the in‑memory buffer is flushed to a new segment, making the data searchable.

Segments are periodically merged into larger segments to reclaim space from deleted documents and reduce the number of file handles. Merges run in the background and are resource‑controlled.

Transaction Log (Translog)

The translog records all operations that have not yet been persisted to a segment. When a flush occurs (default when the translog reaches 512 MB or 30 min), the in‑memory buffer is written to a new segment, the translog is fsynced to disk, and a new empty translog is started. On restart, Elasticsearch replays any uncommitted translog entries to guarantee durability.

Refresh and Flush

Refresh makes recent writes visible to search (near‑real‑time). Flush persists data to disk and clears the translog. Refresh is lightweight and occurs automatically each second per shard; it can be disabled or tuned via index.refresh_interval.

Performance Optimizations

Storage Devices

Prefer SSDs; they dramatically improve I/O throughput.

Use RAID 0 or multiple data paths to stripe data across disks.

Avoid remote network mounts (NFS, SMB) and cloud block storage with high latency.

Internal Index Optimizations

Lucene stores terms in a sorted term dictionary and uses a compressed term index (FST) to locate dictionary offsets efficiently. Posting lists are also compressed. These structures keep most of the index in memory, reducing disk seeks.

Configuration Tweaks

Use sequential, monotonic IDs instead of random UUIDs to improve term dictionary compression.

Disable doc values on fields that are never used for sorting or aggregations.

Prefer keyword over text for exact‑match fields to avoid unnecessary analysis.

Increase index.refresh_interval (e.g., to 30 s) for bulk indexing, or set it to -1 to disable refresh entirely during massive imports.

Use _search/scroll instead of deep pagination to avoid large from+size heap allocations.

Limit the number of mapped fields to only those needed for search, aggregation, or sorting.

Specify routing values when possible to target specific shards and improve query locality.

JVM Tuning

Set -Xms and -Xmx to the same value (no more than 50 % of physical RAM, and not exceeding 32 GB).

Consider using the G1 garbage collector instead of the default CMS.

Ensure ample free RAM for the operating system’s filesystem cache, as Elasticsearch heavily relies on it for fast reads.

By applying these storage, indexing, and JVM recommendations, Elasticsearch clusters can achieve higher throughput, lower latency, and better stability under production workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Indexing Search Engine Elasticsearch Lucene

Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.