Mastering Elasticsearch Cluster Planning, Configuration, and Monitoring
This article, based on Xu Peng’s Gdevops 2017 talk, details the rationale for choosing Elasticsearch, outlines the overall architecture, provides step‑by‑step OS, JVM, and index parameter settings, and explains comprehensive monitoring strategies to ensure high‑availability and performance of large‑scale ES clusters.
1. Overall Architecture
Elasticsearch is chosen as the search engine because it enables fast, scalable querying of massive datasets. The architecture separates data ingestion, processing, and storage using Kafka for decoupling, ETL pipelines, and tiered storage: cold data in HDFS, warm data in databases or caches, and hot data directly indexed in Elasticsearch.
2. Cluster Planning
The cluster consists of three layers: a query entry node (no data), data nodes that store and search the indices, and master nodes that manage metadata such as node information and index settings. The diagram (ES 5.x) also includes an optional ingest node for preprocessing documents before indexing.
3. Cluster Configuration
3.1 OS Parameter Settings
Key Linux settings include increasing the maximum number of open files to 65535 and tuning virtual memory parameters. Because Elasticsearch uses memory‑mapped files, vm.max_map_count and related vm.dirty_background_ratio / vm.dirty_ratio must be adjusted to control when dirty pages are flushed to disk, preventing long pauses similar to Java GC.
Swap can be disabled (set vm.swappiness=0) or limited to a minimal value to avoid OOM situations while still protecting the kernel.
3.2 Elasticsearch JVM Settings
Regardless of physical RAM, allocate a maximum of 32 GB to the Elasticsearch JVM to avoid the 32‑bit pointer limitation. Enable -XX:ExitOnOutOfMemoryError=1 (requires JDK 1.8.0_92 or newer) so the process terminates cleanly on OOM, allowing external monitors to restart it.
When upgrading the JDK is not possible, add the following option to the JVM launch parameters to kill the process on OOM:
-XX:OnOutOfMemoryError="kill -9 %p"3.3 Index Parameter Settings
Important index‑level settings include:
refresh_interval : controls how quickly newly indexed documents become searchable; shorter intervals increase I/O, so for heavy bulk loads a larger value (e.g., 90‑100 s) is recommended.
number_of_shards : set based on expected data volume; cannot be changed after index creation, so plan ahead.
number_of_replicas : can be set to 0 during bulk ingestion and increased later.
merge scheduler : adjust thread count based on storage type (default 1 for spinning disks, higher for SSDs).
index.routing.allocation.balance.shard : default 0.5; lowering reduces shard imbalance tolerance.
Segment merging is essential; too many small segments waste file handles and degrade query performance. The flush size can also be increased for large batches.
Dynamic templates can map short strings (< 10 KB) as keyword. Very large fields that are only stored (no search) should be defined as type: object, enabled: false to avoid parsing.
4. Cluster Monitoring
4.1 Monitoring Content
Effective monitoring covers both OS metrics (CPU, memory) and Elasticsearch‑specific metrics (shard distribution, field data memory, index size, query load). Uneven shard allocation or excessive field count can cause performance bottlenecks.
4.2 Monitoring Tools
The team built a custom dashboard called eyeones because the built‑in X‑Pack (ES 5.x) and Marvel (ES 1.x) lacked many useful metrics. The dashboard displays per‑node load, memory usage, number of indices, query rate, and shard recovery status.
Detailed index‑level metrics are also available; clicking an index reveals its specific statistics. The monitoring source code is hosted on GitHub for community contributions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
