Operations 28 min read

How to Scale Elasticsearch for PB‑Level Game Logs: Real‑World Strategies & Lessons

This article walks through a mid‑size gaming company's journey of deploying, tuning, and scaling an Elasticsearch cluster for massive log volumes, covering hot‑cold node architecture, ILM policies, shard management, Logstash‑Kafka optimization, emergency expansions, and the promise of searchable snapshots to achieve petabyte‑scale storage with cost efficiency.

Efficient Ops

Aug 24, 2020

How to Scale Elasticsearch for PB‑Level Game Logs: Real‑World Strategies & Lessons

Background

A mid‑size internet company’s game business uses Tencent Cloud Elasticsearch with an ELK stack to store massive logs (peak 1M QPS). After several optimizations the cluster became stable, reducing read/write errors and cost.

Scenario 1: First Contact

Solution architect Bellen meets the client, discusses hot‑cold node architecture, ILM, snapshot to COS, and an API for querying cold data without full restore. Provides a cluster sizing suggestion (single node up to 6 TB disk, 8 CPU / 32 GB RAM, 20 k QPS write).

Scenario 2: Cluster Under Pressure

After a few days the ES cluster cannot keep up with Logstash‑Kafka ingestion. CPU, load, and JVM heap usage are high, causing frequent GC and node flapping. The team expands nodes vertically (32 CPU / 64 GB RAM) and horizontally, adjusts shard count, and monitors write throughput.

Storage capacity : consider replica count, data expansion, segment merges, OS usage; reserve ~50 % free space, total capacity ≈ 4× raw data.

Compute resources : 2 CPU / 8 GB supports ~5 k QPS writes; scale linearly with nodes.

Shard and index evaluation : keep shard size 30‑50 GB, limit shards per node (20‑30 per GB heap), total cluster shards < 30 k.

Scenario 3: Logstash‑Kafka Tuning

Issues: adding more Logstash instances does not linearly increase consumption; uneven topic/partition distribution.

Increase Kafka topic partitions.

Group Logstash consumers per heavy topic.

Match total consumer_threads to total partitions (e.g., 3 Logstash × 24 partitions → 8 threads each).

Upgrade Logstash from 5.6.4 to 6.8 to avoid a bug where oversized messages cause crashes.

Scenario 4: Disk Full – Emergency Expansion

During a traffic surge daily ingest grew to 20 TB, pushing disk usage to 80 %. The team added warm nodes with SATA disks and used ILM to move old indices, increasing total capacity to 780 TB.

When switching from cloud SSD to local SSD/SATA, they added new nodes with local disks, migrated data, and removed old nodes to avoid service interruption.

Scenario 5: 100 k Shards Issue

After migration, the cluster showed yellow status, many unassigned shards, and master node failures due to heap pressure from excessive shard metadata. Solutions: increase master heap, reduce replica count, shrink old indices, and adjust shard allocation settings.

Scenario 6: ILM Pitfalls

ILM shrink only applies to newly created indices; existing indices require manual settings changes. Combining shrink with warm‑phase migration can trigger bugs causing unassigned shards.

Scenario 7: Custom SLM

To keep total shard count below 100 k, the team proposes:

Cold‑backup old indices to COS, then set replica = 0.

Daily snapshot scripts using Tencent Cloud SCF.

Adjust ILM to set replica = 0 in warm phase, avoiding shrink failures.

Scenario 8: Searchable Snapshots

Searchable Snapshots allow cold data stored in COS/S3 to be queried on‑demand, reducing cluster size and cost while meeting latency requirements for log analysis.

Conclusion

Perform thorough capacity and node‑spec evaluation before launching a new cluster.

Control overall shard count through ILM, shrink, and replica management to ensure stability.

Monitor and adopt emerging features such as Searchable Snapshots for long‑term cost efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Elasticsearch log management cluster scaling ILM

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Scenario 1: First Contact

Scenario 2: Cluster Under Pressure

Scenario 3: Logstash‑Kafka Tuning

Scenario 4: Disk Full – Emergency Expansion

Scenario 5: 100 k Shards Issue

Scenario 6: ILM Pitfalls

Scenario 7: Custom SLM

Scenario 8: Searchable Snapshots

Conclusion

Efficient Ops

How this landed with the community

Was this worth your time?

0 Comments

Scenario 5: 100 k Shards Issue