Operations 12 min read

Running a 400+ Node Elasticsearch Cluster: Architecture, Scaling, and Performance Tuning

Meltwater details how it processes millions of daily media posts using a custom‑tuned Elasticsearch 1.7.6 cluster of over 400 nodes on AWS, covering data volume, query complexity, node configuration, indexing strategy, performance optimizations, and lessons learned for large‑scale search deployments.

Architecture Digest
Architecture Digest
Architecture Digest
Running a 400+ Node Elasticsearch Cluster: Architecture, Scaling, and Performance Tuning

Meltwater processes millions of posts daily and uses Elasticsearch to store and search media data such as news articles, public Facebook and Instagram posts, blogs, and Weibo.

Since version 0.11.X they have been loyal Elasticsearch users; the article shares lessons learned, tuning tips, and pitfalls.

Data volume: peak indexing of over 3 million editorial articles and nearly 100 million social posts per day, retaining editorial data since 2009 and social data for ~15 months, consuming ~200 TB primary and ~600 TB replica storage.

Query load: about 3 000 requests per minute through a “search‑service”, with complex Lucene‑style queries such as Tesla AND "Elon Musk" NOT (SpaceX OR PayPal) , some spanning more than 60 pages.

Cluster version: a custom build of Elasticsearch 1.7.6 back‑ported with roaring bitsets from Lucene 5; upgrade to newer major versions is difficult due to full‑cluster restarts and limited rollback.

Node configuration: running on AWS i3.2xlarge instances across three availability zones, three master‑eligible nodes with discovery.zen.minimum_master_nodes=2 , 430 data nodes, 26 GB heap per node, and using Terraform and Puppet for provisioning.

Indexing strategy: time‑based indices similar to the ELK stack, resulting in ~40 k shards; cluster state size ~100 MB (compressed to 3 MB) but can generate >1 GB of traffic during index deletions.

Performance optimizations include limiting search scope, avoiding leading wildcards, monitoring CPU/GC metrics, using external caches, shard allocation awareness (aws_availability_zone), and custom plugins that add wildcard support in phrase queries and replace match‑all with “*”.

Testing methodology involves replaying production queries on a laptop or dedicated test cluster, profiling with Java Mission Control or VisualVM, and iteratively adjusting queries or Lucene code to reduce CPU and memory usage.

The author invites readers who have migrated to Elasticsearch 2.X to share their experiences.

big dataElasticsearchPerformance TuningLuceneAWSCluster Scaling
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.