Elasticsearch Architecture Overview and Core Concepts
This article provides a comprehensive overview of Elasticsearch, covering data types, Lucene fundamentals, cluster architecture, shard allocation, indexing mechanisms, storage strategies, refresh and translog processes, segment merging, performance tuning, and JVM optimization for building scalable, near‑real‑time search solutions.
1. Data in Real Life
Data can be divided into structured (e.g., relational tables) and unstructured (e.g., documents, images, videos). Searching these two categories leads to structured data search via relational databases and unstructured data search via full‑text techniques such as sequential scanning or inverted indexes.
2. About Lucene
Lucene is an open‑source library that provides the core inverted‑index functionality used by full‑text search engines like Solr and Elasticsearch. It creates a term dictionary and posting lists that map each term to the documents containing it.
Term Doc_1 Doc_2 Doc_3
-------------------------------------
Java | X | |
is | X | X | X
the | X | X | X
best | X | X | X
programming | X | X | X
language | X | X | X
PHP | | X |
Javascript | | | X
-------------------------------------3. ES Core Concepts
Elasticsearch is a distributed, near‑real‑time search and analytics engine built on Lucene. It provides a RESTful API, automatic cluster management, and features such as master‑eligible nodes, data nodes, and coordinating nodes.
Cluster (Cluster)
A cluster consists of one or more nodes sharing the same cluster.name. Nodes discover each other via Zen Discovery (unicast or multicast) and elect a master node to manage metadata and shard allocation.
Node Roles
Nodes can be master‑eligible ( node.master: true) and/or data nodes ( node.data: true). Master nodes handle cluster state, while data nodes store shards and perform indexing/search operations. Coordinating nodes forward client requests to the appropriate shards.
Split‑Brain (Brain Split)
When network partitions cause multiple masters, a quorum setting ( discovery.zen.minimum_master_nodes) prevents split‑brain by requiring a minimum number of master‑eligible nodes to form a healthy cluster.
4. Shards and Replicas
Indices are divided into primary shards (default 5) and replica shards. Shards enable horizontal scaling; replicas provide high availability and load balancing. The number of replicas cannot exceed N‑1 where N is the number of data nodes.
5. Mapping
Mappings define field types (e.g., text, keyword, integer, date) and analysis behavior. Dynamic mapping infers types automatically, while explicit mapping gives precise control over indexing and search.
6. Basic Usage
Installation is a simple unzip; start the node with bin/elasticsearch. The REST API listens on port 9200. Cluster health can be checked via GET /_cluster/health, returning statuses green, yellow, or red.
7. Indexing Process
Documents are routed to a primary shard using shard = hash(routing) % number_of_primary_shards. The routing value defaults to the document _id but can be customized. The coordinating node forwards the request to the primary shard, which writes to the translog and in‑memory buffer.
Refresh and Flush
Every second (default) a refresh creates a new segment from the in‑memory buffer, making recent documents searchable. When the translog reaches 512 MB or 30 minutes, a flush writes the buffer to disk, creates a commit point, and clears the translog.
Segment Merging
Background merges combine small segments into larger ones, reclaiming space from deleted documents and reducing the number of file handles, which improves search performance.
8. Performance Optimization
Use SSDs or RAID‑0 for higher I/O throughput.
Avoid remote file systems (NFS, SMB) and overly large JVM heaps (>32 GB).
Prefer sequential IDs over random UUIDs for better compression.
Disable doc values on fields not used for sorting/aggregations.
Use keyword instead of text when full‑text analysis is unnecessary.
Adjust index.refresh_interval (e.g., 30s) or disable it ( -1) during bulk indexing.
Prefer the Scroll API over deep pagination.
Set appropriate routing values to target specific shards.
9. JVM Tuning
Set Xms and Xmx to the same value (no more than 50 % of physical RAM and ≤32 GB). Consider using the G1 garbage collector instead of CMS. Ensure ample OS file‑system cache for fast segment reads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
