Big Data 29 min read

Case Study: Optimizing Tencent Cloud Elasticsearch for High‑Volume Game Log Analytics

To handle a gaming company's million‑QPS log stream, the team built a hot‑cold Tencent Cloud Elasticsearch cluster with ILM‑driven tiering, scaled CPU/heap, reduced shard count via shrink and replica tweaks, tuned Logstash‑Kafka pipelines, and employed COS snapshots and searchable snapshots, achieving stable performance and lower cost.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Case Study: Optimizing Tencent Cloud Elasticsearch for High‑Volume Game Log Analytics

Background: A mid‑size internet gaming company uses Tencent Cloud Elasticsearch (ELK stack) to store massive game logs. The write‑peak reaches 1 million QPS, and the cluster experiences frequent read/write anomalies and high costs.

Initial engagement: The solution architect (bellen) meets the client’s operations leader, who demands a storage solution for a year’s worth of logs (~3 PB, 10 TB per day) that keeps costs low while maintaining high availability.

Proposed hot‑cold architecture: Use hot nodes with SSD cloud disks and warm nodes with SATA disks, leveraging Elasticsearch’s ILM (Index Lifecycle Management) to move older indices from hot to warm tiers. Snapshots are taken to COS object storage for long‑term retention.

Performance bottleneck: After deploying a 10‑node cluster (8 CPU × 32 GB each), logstash consuming Kafka data cannot keep up. Monitoring shows high CPU, JVM heap usage >90 % and frequent GC, causing node flapping.

Root cause analysis reveals the pipeline filebeat → kafka → logstash → elasticsearch with 20 logstash instances (batch size 5000) overwhelms the ES write path. The cluster can theoretically handle 100 k QPS, but the actual write load exceeds this due to excessive shard count and sub‑optimal node sizing.

Scaling actions: Vertically scale nodes to 32 CPU × 64 GB, then horizontally add nodes, adjusting shard count accordingly. Provide a sizing guideline: 2 CPU × 8 GB supports ~5 k QPS; 8 CPU × 32 GB supports ~20 k QPS per node.

Evaluation checklist for new ES deployments:

Storage capacity – account for replicas, data expansion, merge overhead, OS usage, and keep 50 % free space (total storage ≈ 4 × raw data).

Compute resources – estimate write throughput per node and scale linearly.

Index and shard sizing – keep shard size 30‑50 GB, limit total shards per node (20‑30 per 1 GB heap) and overall cluster shards (<30 k).

Logstash‑Kafka tuning:

Increase Kafka topic partitions to improve parallel consumption.

Group logstash instances per high‑traffic topic to avoid resource contention.

Match consumer_threads to the number of partitions (e.g., 3 logstash processes × 24 partitions → 8 threads each).

Upgrade logstash from 5.6.4 to 6.8 to fix the “message size larger than fetch size” crash:

whose size is larger than the fetch size 4194304 and hence cannot be returned. Increase the fetch size on the client (using max.partition.fetch.bytes), or decrease the maximum message size the broker will allow.

Cold‑storage migration issue: After switching from cloud SSD to local SSD nodes, a massive shard relocation (≈6500 relocating_shards) caused write QPS to drop from 500 k to 10 k. Investigation steps:

GET _cluster/health – cluster green but many relocating_shards and pending tasks.

GET _cat/pending_tasks?v – many urgent “shard‑started” tasks block index creation.

GET _cluster/settings – cluster.routing.allocation.node_concurrent_recoveries set to 50 (default 2), causing overload during node‑to‑node migration.

PUT _cluster/settings – reset the parameter to 2.

Remove exclude allocation settings to stop new migrations.

Increase indices.recovery.max_bytes_per_sec to accelerate shard recovery.

Pre‑create next‑hour indices to avoid index‑creation latency during peak writes.

Result: After these adjustments, the cluster stabilizes and write throughput recovers.

Shard‑count explosion: Hourly index creation (60 shards, 1 replica) leads to >100 k shards per day, exceeding 10 k shards after a few months. Recommendations to keep shard count below 80 k:

Enable ILM warm‑phase shrink (60 → 5 shards) to reduce shard count 12×.

Increase index interval (e.g., every 2 hours) to lower daily shard creation.

Set replica count to 0 for old indices after snapshotting (reduces storage and shard count by ~50 %).

Close the oldest indices only if business permits (not acceptable here).

Cold‑snapshot workflow: Use daily snapshots to COS, then set replica count to 0. Implemented via Tencent Cloud SCF functions that create day‑wise snapshots, poll for SUCCESS, and record progress either in a file or a temporary index.

ILM adjustments:

Warm phase – migrate indices older than 360 h to warm nodes, keep 1 replica.

Shrink phase – reduce shards to 5 after migration.

Cold phase – set replicas to 0 (client rejected, so keep in warm phase).

Bug in ES 6.8: Shrink combined with warm‑phase migration can leave unassigned shards due to replica placement constraints. Workaround: script to reset index.routing.allocation.require settings for affected indices.

Searchable Snapshots: To avoid keeping cold data on hot clusters, use Elasticsearch’s searchable snapshots feature. Snapshots are mounted as read‑only indices, allowing on‑demand queries with acceptable latency for log analytics.

Final recommendations:

Perform thorough capacity planning before cluster launch.

Control total shard count via ILM shrink, index interval adjustment, and replica reduction.

Monitor and tune cluster routing settings during large migrations.

Adopt searchable snapshots for long‑term cold data access.

Reference: Searchable Snapshots API – https://www.elastic.co/guide/en/elasticsearch/reference/master/searchable-snapshots-apis.html

Big DataElasticsearchKafkaCluster ScalingILMSearchable SnapshotsLogstash
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.