Big Data 15 min read

Tencent Game Big Data Analysis Engine: Architecture, Practices, and Future Plans

This article presents Tencent's game big‑data analysis platform, detailing its background, the architecture of the iData engine—including offline multi‑dimensional analysis (TGMars), online portrait analysis (TGFace), and real‑time multi‑dimensional analysis (TGDruid)—application scenarios, performance insights, and future ecosystem and open‑source plans.

DataFunTalk
DataFunTalk
DataFunTalk
Tencent Game Big Data Analysis Engine: Architecture, Practices, and Future Plans

Introduction The big‑data analysis platform is a core component of big‑data applications. Traditional BI tools and databases have limitations such as low processing efficiency and insufficient scalability. iData, Tencent's game big‑data analysis system, combines iDataCharts for visualization and iDataEngine for analysis.

Article Outline The article covers three main parts: (1) background of Tencent game big‑data analysis, (2) practical experience with the big‑data analysis engine, and (3) summary and future planning.

01. Tencent Game Big Data Analysis Background

Tencent's game portfolio grows rapidly, with over 110 PC games (e.g., League of Legends, DNF), 390 mobile games (e.g., Honor of Kings), and numerous mini‑games. Daily new data exceeds 300 TB, generating more than 2 × 10⁹ records across roughly 1 300 tables and 430 fields. Each game has a complex data model, requiring fine‑grained operations and rapid analytical support.

The data ecosystem consists of a data lake (cloud storage, MySQL, PostgreSQL, Hadoop, etc.) and a service layer that includes visualization, the data analysis engine, and an engineering experimentation engine for AI research.

02. Big Data Analysis Engine Practice

The engine provides four capabilities:

Offline multi‑dimensional analysis: custom metrics, multi‑dimensional extraction, cross‑analysis, user segmentation.

Portrait analysis: user profiles, funnel analysis, drill‑down, pivot analysis.

Tracking analysis: multi‑dimensional, concurrent, and real‑time tracking.

Real‑time multi‑dimensional analysis: multi‑dimensional aggregation, real‑time deduplication, probing, and prediction.

Application Scenarios

Typical use cases include defining behavior metrics (e.g., active users, payments, match counts), segmenting users (e.g., high‑value “big‑R” users, highly active users), performing churn analysis, creating user portraits (age, gender, region), launching targeted marketing campaigns, and conducting real‑time monitoring of user activity.

Data sources include real‑time streams (Kafka, Pulsar) and offline stores (HDFS, RDS, cloud DB). The processing pipeline creates metric libraries for multi‑dimensional extraction, stores user tags, and feeds results to front‑end visualizations.

1. Offline Multi‑Dimensional Analysis – TGMars

TGMars uses a preprocessing mechanism combined with a single‑shard storage‑compute binding. Optimizations include pre‑sorting and deduplication, localized shard storage (no shuffle), single‑pass user sorting, and MPP parallel processing. Bitmap indexes accelerate hot‑spot calculations, and materialized views (monthly/yearly) reduce full scans. Spark‑SQL is deeply customized via DataSourceV2 to push down filters and perform local file loading.

2. Online Portrait Analysis – TGFace

For 50 million newly registered users, TGFace extracts raw user packages with TGMars, stores them in a Datanode as columnar files, and generates bitmap indexes for attributes (gender, age, region). SQL generated from front‑end filters is parsed, optimized, compiled to machine code, and executed as DAGs. Performance: 1 × 10⁸ records with 6 dimensions processed in 1.25 s (drill‑down), 2.7 s (portrait), 3.4 s (pivot) on a 3‑node, 24‑core, 64 GB setup.

3. Real‑Time Multi‑Dimensional Analysis – TGDruid

Game servers push logs to Kafka/Pulsar; Storm/Flink perform real‑time ETL; filtered streams are ingested into Druid for in‑memory computation. Only the most recent two days of segments are kept in memory; older data is persisted to MySQL for reporting. Configuration‑driven ETL allows task launch within five minutes. Optimizations include time‑based partitioning, error detection for dimension mis‑configuration, and self‑service data back‑fill. A Prophet‑based module provides real‑time forecasting.

03. Summary and Future Planning

The roadmap aims to further ecosystem‑ize the three engines, open‑source the stack, enhance scientific analysis, and integrate Jupyter‑based data labs for experimental workflows.

Thank you for reading.

data engineeringbig datareal-time analyticsOLAPTencentgame analytics
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.