Big Data 20 min read

Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink

The article details how Tencent Advertising re‑architected its massive log pipeline by consolidating heterogeneous real‑time and offline logs into an Iceberg‑based data lake, introducing multi‑level partitioning, Spark and Flink ingestion, and numerous performance and cost optimizations for scalable big‑data analytics.

Tencent Advertising Technology

Dec 27, 2022

Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink

Background: Tencent advertising logs are divided into real‑time and offline streams; real‑time logs are consumed via message queues, while offline logs are stored for minute‑level, hour‑level, and ad‑hoc analysis.

Existing architecture includes Subscriber (Spark Streaming to HDFS), log merging (hourly Spark batch), and a custom columnar “dragon” format based on Parquet, leading to issues such as heterogeneous log formats, storage redundancy, poor usability, high resource consumption, and weak governance.

To address these problems, a lake‑warehouse solution built on Apache Iceberg was designed. The schema introduces three‑level partitioning (hour, traffic site set, ad‑placement set) and stores logs uniformly as Parquet files, enabling SQL/Spark access without worrying about underlying formats.

Offline improvements add hourly Spark ingest jobs that overwrite partitions, ensuring idempotency, and leverage Iceberg’s column‑level metrics for partition and filter pruning, reducing file count and memory usage. A “commit‑by‑manifest” option ( spark.sql.iceberg.write.commit-by-manifest = true) lowers driver memory pressure.

Real‑time enhancements introduce a Flink‑based minute‑level ingest pipeline with exactly‑once semantics, while retaining the original hourly Spark jobs for back‑fill and repair.

The lake‑warehouse brings atomic commits, unified storage, higher compression via columnar Parquet, flexible schema and partition evolution, and multi‑engine support (Spark, Flink, Presto, StarRocks). Additional optimizations include task‑plan reduction, read‑split size tuning, dynamic partition pruning ( spark.sql.iceberg.enable-dynamic-partition-pruning = true), and vectorized reads for complex types.

Operational benefits include ~50 % storage savings, ~40 % compute cost reduction, simplified data access, integrated security/audit, and a foundation for future features such as full Flink real‑time ingestion, async I/O acceleration, index‑based query acceleration, and column‑level lifecycle management.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Optimization Big Data Flink Data Lake Spark Iceberg Log Processing

Written by

Tencent Advertising Technology

Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.