Big Data 13 min read

Running a 400+ Node Elasticsearch Cluster: Architecture, Scaling, and Performance Tuning

Meltwater’s media‑monitoring platform runs a custom Elasticsearch 1.7.6 cluster of over 400 nodes on AWS, handling 200 TB of primary data and 3 million daily documents while serving thousands of complex queries per minute, achieved through careful shard design, master‑node configuration, extensive performance tuning, and automated provisioning.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Running a 400+ Node Elasticsearch Cluster: Architecture, Scaling, and Performance Tuning

Meltwater processes millions of social media posts each day, requiring a storage and retrieval solution that can handle that scale. Since version 0.11.X the company has been a loyal Elasticsearch user and, after several iterations, settled on a custom Elasticsearch 1.7.6 stack to power its media‑monitoring platform.

The platform indexes both news articles and public social‑media posts (Facebook, Instagram, blogs, Weibo). Data is collected via a hybrid API, lightly processed, and then made searchable in Elasticsearch. The article shares lessons learned, tuning tips, and pitfalls to avoid.

Data volume : During peak periods the system indexes over 3 million editorial articles and nearly 100 million social posts per day. Historical editorial data is retained from 2009, while social data is kept for about 15 months, consuming roughly 200 TB of primary shard data and 600 TB of replica data.

Query load : The service receives about 3 000 requests per minute. Queries can be complex, e.g., a Lucene‑style query such as Tesla AND "Elon Musk" NOT (SpaceX OR PayPal). The longest queries span more than 60 pages, and the cluster must return large result sets quickly.

Version choice : The team runs a back‑ported version of Elasticsearch 1.7.6 with roaring bitsets from Lucene 5. Upgrading is difficult because rolling upgrades are only supported from ES 5 to 6, so the cluster is restarted for major upgrades, accepting a 30‑60 minute downtime.

Node configuration : Since June 2017 the primary cluster runs on AWS i3.2xlarge instances. Three master‑eligible nodes are spread across three availability zones with discovery.zen.minimum_master_nodes=2 to avoid split‑brain scenarios. About 430 data nodes are deployed, each with 64 GB RAM (26 GB heap) and ext4 disks.

Index design : Time‑based indices are used, similar to the ELK stack. Each day’s data is stored in separate indices, resulting in close to 40 000 shards. This high shard count puts pressure on the master node during index deletions; the cluster state can be ~100 MB (compressed to 3 MB) and must be propagated to all nodes.

Performance tuning includes:

Restrict search scope to relevant time ranges and avoid range filters on large index sets.

Prefer prefix wildcards over leading wildcards.

Monitor CPU, I/O wait, and GC statistics; avoid tuning GC parameters directly.

Consider alternative JVMs (e.g., Azul Zing) for higher throughput.

Leverage Elasticsearch and Lucene caches via filtered queries.

Detect hot nodes and rebalance shards using allocation filtering and cluster rerouting.

Build realistic test environments with representative data and replay queries.

Enable profilers (Java Mission Control, VisualVM) to pinpoint bottlenecks.

Iteratively rewrite queries or modify query types to reduce CPU and memory usage.

The team also uses Terraform for provisioning, Puppet for configuration management, and an AWS plugin to set

cluster.routing.allocation.awareness.attributes=aws_availability_zone

, ensuring replicas are spread across zones.

Overall, the article demonstrates how careful architecture, diligent monitoring, and systematic performance tuning enable a 400‑plus node Elasticsearch cluster to serve massive, complex search workloads reliably.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JVMScalabilityElasticsearchperformance tuningAWSCluster Management
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.