Big Data 17 min read

Tencent PCG Real‑Time Data Warehouse and Operations Architecture Overview

This article presents Tencent's PCG data platform evolution, detailing the challenges of integrating multiple business groups, the design of a unified big‑data architecture, real‑time and batch processing pipelines, MQ and ATTA systems, and comprehensive operational practices for reliability and scalability.

DataFunSummit
DataFunSummit
DataFunSummit
Tencent PCG Real‑Time Data Warehouse and Operations Architecture Overview

01 Tencent PCG – Background and Challenges PCG aggregates short‑video, news feed, social communication, and tools from different business groups, resulting in inconsistent data standards, quality, and usability. The team first mapped all big‑data scenarios, defined a top‑level architecture, and assigned responsibilities to various teams while centralizing metadata and permission management.

02 MQ Architecture Refactoring Traditional Kafka clusters faced capacity limits and fault tolerance issues. The solution abstracts controller nodes, groups multiple clusters, enables seamless failover with 10‑15 s migration, supports elastic scaling, and expands topic capacity to millions by multi‑region deployment, preserving Kafka’s low‑latency disk I/O.

03 ATTA Log Pipeline System ATTA provides multi‑language SDKs (C++, Java, Go, Python) integrated as default plugins with RPC frameworks, allowing fast, reliable log ingestion across regions. Logs are stored locally, then distributed to MQ, query services, and offline warehouses, ensuring low latency and high availability.

04 Real‑Time Data Warehouse – Stream‑Batch Unified Architecture To reduce duplicated effort, a unified platform built on Flink processes both streaming and batch data with a single codebase. Processed data is written to real‑time and offline warehouses, enabling flexible analytics, reporting, and rapid re‑processing when needed.

05 Defining System Operability Goals The team establishes five operability levels with clear error‑budget thresholds, SLA/SLO definitions, and quality gates. Higher levels require stricter reliability, automated testing, and controlled releases, aligning platform maturity with business needs.

06 Full Lifecycle Management Quality assurance follows a lifecycle: design (architecture, monitoring, OPS API), development (code quality, unit‑test thresholds), release (pre‑release mirroring, chaos engineering, gray rollout), and continuous operation (automated incident response, MTTR/MTBF tracking).

07 Comprehensive Monitoring Combining custom monitoring, metrics, and logs, the system achieves near‑exactly‑once guarantees where feasible and employs data auditing and coloring to detect loss or duplication, providing minute‑level visibility across the data flow.

08 Digital Operations Capability A self‑service, automated operations framework aggregates metric and log data into a data warehouse, feeding BI dashboards and enabling automated fault diagnosis and remediation, while continuously improving root‑cause analysis and prediction.

09 Simplified Incident Handling Standardized CI/CD/CO pipelines, clear SOPs, and automated root‑cause identification reduce manual effort. Incident severity is tied to error‑budget consumption, with automated remediation for most issues and on‑call escalation for critical failures.

10 Providing Full‑Stack, High‑Quality Data Services The roadmap includes expanding open‑source collaboration for foundational components, enhancing the data development lifecycle, and delivering secure, compliant SaaS data applications to thousands of users.

11 Q&A Highlights The platform uses a self‑developed monitoring system (instead of Prometheus), supports multi‑language ATTA SDKs, and employs hybrid model deployment (offline models via cloud disks with P2P distribution, real‑time models in distributed caches like Redis) to balance latency and scalability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataOperationsMQreal-time data warehouseTencentATTAPCG
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.