Big Data 10 min read

Tencent News Massive Log Processing Architecture and Data Applications

The article presents Tencent News' comprehensive massive log processing solution, covering background, overall architecture, data collection, real-time and offline computation layers, data quality assurance, and practical examples such as Flink CDC for database synchronization, illustrating how large‑scale data is managed and applied.

DataFunSummit
DataFunSummit
DataFunSummit
Tencent News Massive Log Processing Architecture and Data Applications

Introduction With the explosion of information in the mobile‑Internet era, Tencent News faces massive data volumes and diverse business scenarios. The discussion focuses on how to process and empower this data.

1. Background Tencent News client, sports, and news plug‑ins generate large amounts of advertising and user‑behavior data, characterized by high volume and wide business coverage, used for reporting, model training, and product decisions.

2. Overall Log Processing Architecture

2.1 Collection Layer Data is collected via the internal "Datatong" reporting service, which standardizes event collection from client SDKs, PC, H5, and backend servers.

2.2 Computation Layer Both real‑time and offline calculations are performed. Offline uses TDW (Hive tables) and HDFS; real‑time leverages Oceanus and Datahub. The design addresses changing requirements, code complexity, high availability, low‑latency ingestion, and data reuse. Message middleware (Tube, CDMQ) transports data between layers.

2.3 Storage Layer Rich storage components include Impala, ClickHouse, MySQL, Redis, etc., serving reporting, data exploration, and downstream applications.

3. Data Reporting Data sources are classified into four categories: client (SDK reporting via Datatong), backend server logs (reported to Tdbank and then to TDW and Tube), DB synchronization (MySQL binlog captured by Flink CDC), and configuration files (offline synced to TDW).

4. Real‑time Computation Framework A Lambda architecture is adopted, sharing processing between ODS and DWD layers. Real‑time components include storage/access layer (message middleware), DWD layer (reducing ODS consumption), computation layer (ETL, feature extraction), and data‑warehouse storage (TDW, HDFS, Impala).

5. Offline Computation Framework Four layers are defined: ODS (raw data), DWD (detailed cleaned data), DWS (light aggregation per business or user), and ADS (application layer for final results, e.g., reports, Redis, ClickHouse).

6. Data Quality and Link Assurance Both online and offline parts include monitoring, SLA guarantees, exception handling, tiered alerts, and automated recovery. Real‑time Flink jobs use try‑catch and message‑middleware‑based alert aggregation, with notifications sent via enterprise WeChat.

7. Summary Standardized reporting via Datatong, unified event schemas, and layered data‑warehouse management improve data governance, reduce duplication, and enhance cross‑team collaboration.

8. Data Application Examples

8.1 Flink CDC – DB Synchronization Flink CDC captures MySQL changes (using Debezium and Kafka) and streams them to downstream systems for real‑time dimension table updates and ranking calculations.

8.2 Implementation Details Two modes are available: SQL‑based and custom deserialization. The custom approach offers greater flexibility by implementing deserialization interfaces and handling SourceRecord data.

Conclusion The presented architecture demonstrates how Tencent News achieves scalable, reliable, and high‑performance processing of massive logs for both real‑time and offline analytics.

big datadata pipelineFlinkreal-time analyticsTencentlog-processing
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.