How to Scale Elasticsearch for PB‑Level Game Logs: Real‑World Strategies & Lessons
This article walks through a mid‑size gaming company's journey of deploying, tuning, and scaling an Elasticsearch cluster for massive log volumes, covering hot‑cold node architecture, ILM policies, shard management, Logstash‑Kafka optimization, emergency expansions, and the promise of searchable snapshots to achieve petabyte‑scale storage with cost efficiency.
Background
A mid‑size internet company’s game business uses Tencent Cloud Elasticsearch with an ELK stack to store massive logs (peak 1M QPS). After several optimizations the cluster became stable, reducing read/write errors and cost.
Scenario 1: First Contact
Solution architect Bellen meets the client, discusses hot‑cold node architecture, ILM, snapshot to COS, and an API for querying cold data without full restore. Provides a cluster sizing suggestion (single node up to 6 TB disk, 8 CPU / 32 GB RAM, 20 k QPS write).
Scenario 2: Cluster Under Pressure
After a few days the ES cluster cannot keep up with Logstash‑Kafka ingestion. CPU, load, and JVM heap usage are high, causing frequent GC and node flapping. The team expands nodes vertically (32 CPU / 64 GB RAM) and horizontally, adjusts shard count, and monitors write throughput.
Storage capacity : consider replica count, data expansion, segment merges, OS usage; reserve ~50 % free space, total capacity ≈ 4× raw data.
Compute resources : 2 CPU / 8 GB supports ~5 k QPS writes; scale linearly with nodes.
Shard and index evaluation : keep shard size 30‑50 GB, limit shards per node (20‑30 per GB heap), total cluster shards < 30 k.
Scenario 3: Logstash‑Kafka Tuning
Issues: adding more Logstash instances does not linearly increase consumption; uneven topic/partition distribution.
Increase Kafka topic partitions.
Group Logstash consumers per heavy topic.
Match total consumer_threads to total partitions (e.g., 3 Logstash × 24 partitions → 8 threads each).
Upgrade Logstash from 5.6.4 to 6.8 to avoid a bug where oversized messages cause crashes.
Scenario 4: Disk Full – Emergency Expansion
During a traffic surge daily ingest grew to 20 TB, pushing disk usage to 80 %. The team added warm nodes with SATA disks and used ILM to move old indices, increasing total capacity to 780 TB.
When switching from cloud SSD to local SSD/SATA, they added new nodes with local disks, migrated data, and removed old nodes to avoid service interruption.
Scenario 5: 100 k Shards Issue
After migration, the cluster showed yellow status, many unassigned shards, and master node failures due to heap pressure from excessive shard metadata. Solutions: increase master heap, reduce replica count, shrink old indices, and adjust shard allocation settings.
Scenario 6: ILM Pitfalls
ILM shrink only applies to newly created indices; existing indices require manual settings changes. Combining shrink with warm‑phase migration can trigger bugs causing unassigned shards.
Scenario 7: Custom SLM
To keep total shard count below 100 k, the team proposes:
Cold‑backup old indices to COS, then set replica = 0.
Daily snapshot scripts using Tencent Cloud SCF.
Adjust ILM to set replica = 0 in warm phase, avoiding shrink failures.
Scenario 8: Searchable Snapshots
Searchable Snapshots allow cold data stored in COS/S3 to be queried on‑demand, reducing cluster size and cost while meeting latency requirements for log analysis.
Conclusion
Perform thorough capacity and node‑spec evaluation before launching a new cluster.
Control overall shard count through ILM, shrink, and replica management to ensure stability.
Monitor and adopt emerging features such as Searchable Snapshots for long‑term cost efficiency.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.