Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink
The article details how Tencent Advertising re‑architected its massive log pipeline by consolidating heterogeneous real‑time and offline logs into an Iceberg‑based data lake, introducing multi‑level partitioning, Spark and Flink ingestion, and numerous performance and cost optimizations for scalable big‑data analytics.
Background: Tencent advertising logs are divided into real‑time and offline streams; real‑time logs are consumed via message queues, while offline logs are stored for minute‑level, hour‑level, and ad‑hoc analysis.
Existing architecture includes Subscriber (Spark Streaming to HDFS), log merging (hourly Spark batch), and a custom columnar “dragon” format based on Parquet, leading to issues such as heterogeneous log formats, storage redundancy, poor usability, high resource consumption, and weak governance.
To address these problems, a lake‑warehouse solution built on Apache Iceberg was designed. The schema introduces three‑level partitioning (hour, traffic site set, ad‑placement set) and stores logs uniformly as Parquet files, enabling SQL/Spark access without worrying about underlying formats.
Offline improvements add hourly Spark ingest jobs that overwrite partitions, ensuring idempotency, and leverage Iceberg’s column‑level metrics for partition and filter pruning, reducing file count and memory usage. A “commit‑by‑manifest” option ( spark.sql.iceberg.write.commit-by-manifest = true ) lowers driver memory pressure.
Real‑time enhancements introduce a Flink‑based minute‑level ingest pipeline with exactly‑once semantics, while retaining the original hourly Spark jobs for back‑fill and repair.
The lake‑warehouse brings atomic commits, unified storage, higher compression via columnar Parquet, flexible schema and partition evolution, and multi‑engine support (Spark, Flink, Presto, StarRocks). Additional optimizations include task‑plan reduction, read‑split size tuning, dynamic partition pruning ( spark.sql.iceberg.enable-dynamic-partition-pruning = true ), and vectorized reads for complex types.
Operational benefits include ~50 % storage savings, ~40 % compute cost reduction, simplified data access, integrated security/audit, and a foundation for future features such as full Flink real‑time ingestion, async I/O acceleration, index‑based query acceleration, and column‑level lifecycle management.
Tencent Advertising Technology
Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.