How Cloud‑Native Transforms Game Operations: Insights from Tencent’s DataMore Platform
This article details how Tencent's IEG Value‑Added Services team migrated a massive game data‑marketing platform to a cloud‑native architecture, outlining the business scenario, the cloud‑native developer platform, operational transformation challenges, technical practices such as asset management, orchestration, dynamic scheduling, monitoring, tracing, chaos engineering, CI/CD, and the resulting cost, stability, efficiency, and business empowerment benefits.
Preface
With the company’s self‑developed cloud migration strategy in full swing, the IEG Value‑Added Services department has switched one‑third of traffic to the cloud, handling over a hundred billion daily page views. The service heavily adopts cloud‑native applications and architectures, and the article shares the team’s transformation ideas, methods, and practices for other teams facing similar challenges.
1. Business Service Scenario
DataMore is a one‑stop intelligent game operation platform that leverages game log data for real‑time and offline computation, providing data‑driven marketing solutions with low cost, rapid and fine‑grained operations, independent of game versions and with strong computational capabilities.
It offers various operation schemes across game lifecycles—acquisition, activation, churn, retention, payment, and propagation—such as friend invitations, task centers, group buying, low‑activity interventions, and churn recall, supporting many game categories and handling over 200 billion daily PVs and peak QPS exceeding 200 k.
2. Cloud‑Native Developer Platform
The term “cloud‑native” was first introduced by Matt Stine in 2013 and later defined by CNCF in 2015 as a set of technologies (containers, service mesh, microservices, immutable infrastructure, declarative APIs) that enable elastic, resilient, and observable applications across public, private, and hybrid clouds.
The core of cloud‑native is the “four‑piece set”: DevOps, continuous delivery, microservices, and containers. It also provides rich PaaS components (databases, caches, middleware, storage, CDN) and seamless multi‑cloud migration.
To meet cloud‑native data‑marketing needs, the department launched the ODP (Odd Point) platform, a one‑stop development‑operations platform built on microservices, integrating components such as TKE, Blue Shield, QCI, and Envoy, achieving a full DevOps service loop (CI, CD, CO).
3. Cloud‑Native Operations Transformation, Challenges, Goals, and Practices
3.1 Transformation Mindset
Operations are evolving from traditional hardware‑centric tasks to service‑oriented roles in a cloud‑native era, akin to assembling a car from modular components and providing supporting services like roads and fuel stations.
Key viewpoints:
Operations will not disappear, but the traditional value chain will change.
Transitioning to SRE is a viable path.
3.2 Cultural Practices
Form FT virtual teams merging development and operations.
Joint meetings, tech sharing, and incident reviews with full FT participation.
Early involvement of operations in project architecture discussions (“left‑shift”).
Collect feedback from all parties to drive continuous improvement.
3.3 Technical Practices
The transition moves from traditional operations to higher‑order cloud‑native capabilities, focusing on full‑chain quality monitoring, intelligent resource scheduling, fault pre‑warning and root‑cause analysis, high‑availability across stages, rapid resource delivery, multi‑cloud orchestration, and seamless business migration.
3.4 Cloud‑Native Asset Management
Traditional CMDBs manage only server information, but cloud‑native environments require flexible asset models for resources such as CLB, CDB, COS, CKafka, etc. By leveraging the HeTu metadata system, assets are modeled, related, and managed dynamically.
3.5 Cloud‑Native Orchestration
Orchestration is divided into three layers: infrastructure (using Terraform for multi‑cloud resources), Kubernetes (Helm/YAML), and job orchestration (Blue Shield job platform). The Blue Shield standard operations engine links these layers, enabling end‑to‑end automated workflows.
3.6 DataOps for Operations
3.6.1 TKE Dynamic Scheduling
Workloads often reserve more resources than needed, leading to low utilization. A predictive model collects current CPU usage via hpa‑metrics‑server, stores it, and the scheduler adjusts replica counts dynamically, releasing idle resources.
Result: CPU utilization increased from 15 % to 28 % in a 10 k‑core cluster.
3.6.2 TBDS Dynamic Scheduling
TBDS data‑warehouse clusters showed average CPU utilization around 55 %. A two‑step allocation model (baseline linear regression with dynamic upper/lower bounds) predicts required cores, then distributes the fixed total cores proportionally.
Result: Utilization rose to 79.5 %, cost reduced by one‑third, and average task time dropped from 27.5 min to 18.1 min (52 % improvement).
3.7 Cloud‑Native Application Monitoring
Prometheus was replaced by Thanos for distributed, highly available monitoring and long‑term storage. Thanos aggregates data from multiple Prometheus sidecars, supports object‑storage back‑ends, and provides unified dashboards via Grafana.
The platform automatically collects service QPS, latency, success rate, and resource usage without code changes, greatly reducing development overhead.
3.8 Full‑Chain Tracing (Data & Business Lineage)
By constructing combined data and business lineage graphs, the team can visualize impact scopes, perform fault back‑tracing, root‑cause analysis, and capacity planning, especially when integrated with monitoring metrics.
3.9 Chaos Engineering in Cloud‑Native
The team built a chaos‑engineering platform (OTeam) to inject CPU, memory, I/O, and status faults into services, observing automatic scaling and resilience without affecting user experience.
3.10 DevOps Continuous Integration & Delivery
The ODP platform provides CI pipelines (supporting Blue Shield, QCI, Jenkins, etc.) with multiple trigger modes (manual, automatic via webhook, pipeline, fast‑release). It supports building public or custom images, injecting environment variables, custom start commands, health checks (liveness/readiness), and tag‑based deployments.
4. Business Benefits of Cloud Migration
4.1 Cost
Cloud products are ready‑to‑use, pay‑as‑you‑go, and managed (e.g., TKE master and etcd), reducing operational labor and focusing on core business.
4.2 Stability
Elastic scaling, automatic pod migration, and containerized environments improve fault tolerance and eliminate environment inconsistencies.
4.3 Efficiency
End‑to‑end CI/CD reduces release cycles from hours to minutes, and routine tasks like scaling or fault handling are simplified.
4.4 Business Enablement
Access to over 20 cloud products (TKE, CFS, COS, CKafka, VOD, etc.) accelerates feature development and reduces technical costs.
5. Conclusion
Cloud‑native brings both challenges and opportunities for operations. By embracing data‑driven automation, orchestration, DevOps, and AIOps, the team continuously improves service quality and user experience, and the transformation journey remains ongoing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
