Elasticsearch Overview: Architecture, Core Concepts, and Performance Optimization
This article provides a comprehensive introduction to Elasticsearch, covering data types, Lucene fundamentals, cluster architecture, node roles, shard and replica mechanisms, mapping, basic usage, health monitoring, indexing workflow, storage strategies, and practical performance tuning techniques.
Elasticsearch is an open‑source, distributed, near‑real‑time search and analytics engine built on Apache Lucene, which provides the core inverted‑index functionality required for full‑text search.
1. Data Types in Search – Real‑world data can be classified as structured (relational tables) or unstructured (documents, images, videos). Structured data is searchable via traditional SQL indexes, while unstructured data requires full‑text indexing and an inverted index.
2. Lucene Basics – Lucene creates an inverted index by tokenizing each document into terms, building a term dictionary and posting lists. This index is stored as immutable segments on disk, enabling fast read‑only queries.
3. Elasticsearch Core Concepts
Cluster – A set of nodes sharing the same cluster.name. One node acts as the master, handling cluster state, node discovery (Zen Discovery), and shard allocation.
Node Roles – Nodes can be master‑eligible ( node.master: true) and/or data nodes ( node.data: true). Separating these roles improves stability.
Shards & Replicas – An index is split into a configurable number of primary shards; each primary can have multiple replica shards for fault tolerance and load balancing. Shard placement follows the formula shard = hash(routing) % number_of_primary_shards, where routing defaults to the document _id.
Mapping – Defines field types (e.g., text, keyword, integer, date) and analysis settings. Dynamic mapping infers types automatically, while explicit mapping gives precise control.
4. Basic Usage – Install by extracting the zip, start with bin/elasticsearch, and access the REST API on port 9200. Create indices with settings for number_of_shards and number_of_replicas, and define mappings in the request body.
5. Cluster Health – GET /_cluster/health returns status (green, yellow, red) indicating shard allocation and replica availability.
6. Indexing Workflow – Documents are first written to the translog and memory. A refresh (default every 1 s) makes them searchable by creating a new segment in the filesystem cache. A flush (when translog reaches 512 MB or 30 min) persists segments to disk and clears the translog.
7. Storage Mechanics – Segments are immutable; deletions are recorded in a .del file and cleaned up during background segment merges. Merges combine small segments into larger ones, reclaiming space and improving query performance.
8. Performance Optimization
• Use SSDs or RAID 0 for high I/O throughput. • Configure multiple path.data directories to stripe data across disks. • Tune index.refresh_interval (e.g., 30s) and temporarily disable replicas during bulk indexing. • Choose appropriate field types (use keyword instead of text when analysis is not needed). • Avoid deep pagination; use scroll API. • Set JVM heap ( Xms = Xmx) to ≤ 50 % of physical RAM and consider G1GC.
By understanding these concepts and applying the recommended settings, developers can deploy a robust, scalable Elasticsearch cluster suitable for large‑scale search and analytics workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
