Understanding Elasticsearch Cluster Architecture and Distributed Data System Design
This article explains the architecture of Elasticsearch clusters, covering core concepts such as nodes, indices, shards, replicas, indexing flow, mixed and tiered deployment models, data‑layer storage, and compares these designs with other distributed data system architectures.
Distributed systems come in many types, and their design varies widely; this article focuses on distributed data systems such as storage, search, and analytics, using Elasticsearch as a concrete example.
Elasticsearch Overview – Elasticsearch is a popular open‑source search and analytics engine widely used for search, JSON document storage, and time‑series data analysis. Key concepts include:
Node: a running Elasticsearch instance (typically a process on a machine).
Index: logical collection of mapping and inverted/forward index files.
Shard: a partition of an index managed by a node; primary and replica shards exist.
Replica: a copy of a shard that ensures strong or eventual consistency.
Indexing Process – When creating an index, a document is routed to the primary shard, indexed there, then replicated to its replica shards before the operation returns success.
Role Deployment Models
1. Mixed deployment (default): Data and Transport roles coexist on the same node, simplifying setup but causing resource contention and connection‑count limits.
2. Tiered deployment : Separate Transport nodes handle request routing and result merging, while dedicated Data nodes store and process data, improving isolation, scalability, and enabling hot upgrades.
Data‑Layer Architecture – Elasticsearch stores index and metadata on local file systems using various loading methods (niofs, mmap, etc.). Replicas provide fault tolerance, improve read throughput, and protect against data loss.
Replica Benefits and Drawbacks – Replicas increase availability, reliability, and query capacity, but they add storage cost, write latency, and slower scaling when adding new replicas.
Other Distributed Data System Architectures
1. Local‑FS based systems – Shards and replicas reside on each node’s disk; node failures require replica promotion and data copy, which can be time‑consuming.
2. Shared‑FS (distributed file system) based systems – Compute and storage are separated; shards reference files on a shared storage layer (e.g., HDFS), allowing independent scaling of compute and storage and faster recovery, though network access may be slower.
Both approaches have trade‑offs; the choice depends on workload characteristics and operational requirements.
Conclusion – Different distributed data system architectures each have strengths and weaknesses; understanding Elasticsearch’s cluster and data‑layer designs helps inform decisions when building or evaluating distributed storage, search, or analytics solutions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
