Cloud Native 11 min read

Prometheus Monitoring Practices for Tencent Happy Dou Dizhu Game

Tencent transformed its popular Happy Dou Dizhu game’s monitoring by migrating to Tencent Cloud Managed Prometheus and Grafana, unifying metric naming, consolidating ServiceMonitors, defining dashboards as code, and avoiding high‑cardinality labels, which cut labor costs by over 30% and greatly improved operational efficiency.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Prometheus Monitoring Practices for Tencent Happy Dou Dizhu Game

The article presents a detailed case study of how Tencent's popular online card game "Happy Dou Dizhu" is monitored and operated using cloud‑native technologies.

Background : Happy Dou Dizhu, launched in 2008, serves over 100 million users with a distributed micro‑service architecture that handles millions of daily active users. The scale and complexity of the game’s backend created significant monitoring challenges.

Practice Background : After a multi‑year cloud‑native transformation, the game’s services were fully deployed on Tencent Cloud Container Service (TKE). The legacy monitoring solution could no longer meet the requirements, so the team adopted Tencent Cloud Managed Prometheus (TMP) as the primary monitoring system.

Metric Reporting : All services report a unified metric name (e.g., hlsvr_business_trigger_count ) with labels to differentiate dimensions. This simplifies dashboard creation and maintenance.

Metric Collection : A single ServiceMonitor is configured to scrape metrics from all services, reducing the number of Prometheus scrape jobs and CPU load.

Dashboard Management : Two approaches were evaluated for Grafana dashboards:

"Grafana as code" – dashboards are defined in YAML and deployed via Helm, enabling version control and batch updates.

Direct UI editing – more intuitive but less suitable for large‑scale changes.

The team chose a hybrid method: a unified dashboard defined as code for common metrics, with occasional UI tweaks for special cases.

Unified Dashboard : By fixing metric names, a single dashboard with a service‑selection dropdown can monitor any game service, reducing the need for multiple bespoke dashboards.

Alert Panels : A unified monitoring dashboard is built using Grafana’s Explore feature to generate PromQL queries, and alert panels are created via the Panel Library. Specific alerts can still be maintained in separate dashboards when needed.

Overall Impact : The cloud‑native monitoring solution led to smoother operation of the game backend and a measurable increase in development‑operations efficiency, saving more than 30% of labor costs per month.

Core Highlights :

Dimension aggregation using label‑based data structures provides powerful, fine‑grained analysis.

PromQL offers a highly expressive query language for time‑series data.

Time‑Series Database Design Insights : The article lists key design principles such as multi‑level caching, sequential writes with compaction, SSD usage, multi‑level indexing, mmap, ordered processing, compression, backup, sharding, and write‑ahead logging.

Why Tencent Cloud Prometheus? The managed service integrates tightly with Grafana and TKE, offers high availability, and eliminates the operational overhead of self‑hosting Prometheus.

Pitfalls & Lessons Learned :

Creating a separate ServiceMonitor per service caused high CPU usage.

Using a high‑cardinality label (e.g., full URLs) led to performance degradation.

Repeated panels for thousands of label values caused UI lag.

Solutions included consolidating ServiceMonitors, avoiding high‑cardinality labels, and adjusting API calls.

Conclusion : Prometheus + Grafana has become the de‑facto observability standard in the cloud‑native era. The case study demonstrates practical steps, benefits, and common traps when adopting these tools for large‑scale game operations.

monitoringcloud nativeKubernetesPrometheusGrafanaTencent CloudGame Operations
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.