Big Data 20 min read

Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink

The article details how Tencent Advertising re‑architected its massive log pipeline by consolidating heterogeneous real‑time and offline logs into an Iceberg‑based data lake, introducing multi‑level partitioning, Spark and Flink ingestion, and numerous performance and cost optimizations for scalable big‑data analytics.

Tencent Advertising Technology
Tencent Advertising Technology
Tencent Advertising Technology
Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink

Background: Tencent advertising logs are divided into real‑time and offline streams; real‑time logs are consumed via message queues, while offline logs are stored for minute‑level, hour‑level, and ad‑hoc analysis.

Existing architecture includes Subscriber (Spark Streaming to HDFS), log merging (hourly Spark batch), and a custom columnar “dragon” format based on Parquet, leading to issues such as heterogeneous log formats, storage redundancy, poor usability, high resource consumption, and weak governance.

To address these problems, a lake‑warehouse solution built on Apache Iceberg was designed. The schema introduces three‑level partitioning (hour, traffic site set, ad‑placement set) and stores logs uniformly as Parquet files, enabling SQL/Spark access without worrying about underlying formats.

Offline improvements add hourly Spark ingest jobs that overwrite partitions, ensuring idempotency, and leverage Iceberg’s column‑level metrics for partition and filter pruning, reducing file count and memory usage. A “commit‑by‑manifest” option ( spark.sql.iceberg.write.commit-by-manifest = true ) lowers driver memory pressure.

Real‑time enhancements introduce a Flink‑based minute‑level ingest pipeline with exactly‑once semantics, while retaining the original hourly Spark jobs for back‑fill and repair.

The lake‑warehouse brings atomic commits, unified storage, higher compression via columnar Parquet, flexible schema and partition evolution, and multi‑engine support (Spark, Flink, Presto, StarRocks). Additional optimizations include task‑plan reduction, read‑split size tuning, dynamic partition pruning ( spark.sql.iceberg.enable-dynamic-partition-pruning = true ), and vectorized reads for complex types.

Operational benefits include ~50 % storage savings, ~40 % compute cost reduction, simplified data access, integrated security/audit, and a foundation for future features such as full Flink real‑time ingestion, async I/O acceleration, index‑based query acceleration, and column‑level lifecycle management.

optimizationbig dataFlinkdata lakeSparkiceberglog-processing
Tencent Advertising Technology
Written by

Tencent Advertising Technology

Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.