Operations 12 min read

Managing Massive Elasticsearch Clusters: Lessons from a 120‑Node Deployment

This article shares practical insights on operating large‑scale Elasticsearch clusters for log analysis, covering use cases, essential tools, hardware choices, node role separation, shard management, hot‑cold data strategies, version upgrades, and key monitoring metrics to ensure stability and performance.

Efficient Ops
Efficient Ops
Efficient Ops
Managing Massive Elasticsearch Clusters: Lessons from a 120‑Node Deployment

1. Essential Tools

From the start, use a distributed configuration‑management tool (e.g., Puppet, Chef, Ansible) for cluster deployment, bulk configuration changes, version upgrades, and node restarts; we use Ansible Playbooks. The Sense plugin (now built into Kibana) offers a convenient REST console with syntax hints and auto‑completion.

2. Hardware Configuration

Our servers are equipped with 32 vCPU and 128 GB RAM. Most machines use 12 × 4 TB SATA disks in RAID‑0, while a subset uses 6 × 800 GB SSDs in RAID‑0 to separate hot and cold data.

3. Cluster Management

Separate node roles (master, client, data) to avoid performance bottlenecks and simplify failure recovery.

Control shard count and thread‑pool sizes to prevent excessive concurrency and heap pressure.

Separate hot and cold data: route cold indices to dedicated nodes with custom attributes, run multiple ES instances per cold node with limited heap (e.g., 31 GB each) to keep large cold indices open while conserving memory.

Group shards of different data volumes onto distinct node groups using index routing and node attributes, ensuring more balanced resource utilization.

Regularly force‑merge shards so each shard ends up as a single segment, reducing heap consumption and speeding up terms aggregations.

4. Version Choice

We ran version 2.4 for a long time; version 5.0 introduces stricter bootstrap checks, better index performance, smaller numeric structures, instant aggregation caching, and stronger circuit‑breaker protections. After upgrading we reported three issues (one fixed in 5.0.2) and applied work‑arounds to restore stability.

5. Monitoring

Use X‑Pack or the native stats API with your preferred monitoring stack. Critical metrics include thread‑pool activity (active/queued/rejected), JVM heap usage and old‑GC frequency, segment memory size and count, and HTTP access logs for user activity. Recording these helps identify bottlenecks and plan capacity expansions.

Finally, continuously experiment, consult official documentation, and search for similar problems online; developers with programming experience can also dive into the source code for deeper understanding.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ElasticsearchCluster ManagementHardware Scaling
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.