Building a High‑Performance Advertising Feature Data Lake with Apache Iceberg at Tencent
Tencent's advertising team replaced a traditional HDFS‑Hive warehouse with an Apache Iceberg‑based data lake, adding primary‑key tables, multi‑stream merging, adaptive compaction, and Spark SPJ optimizations to achieve minute‑level feature update latency, 10× back‑fill speed, and up to 60% storage savings.
Tencent advertising processes trillions of new records and petabytes of intermediate data daily; to improve read/write performance and storage management, the feature engineering team selected Apache Iceberg as the foundation for a unified feature data lake.
Compared with Hudi and the legacy HDFS+Hive solution, Iceberg offered flexible metadata handling, efficient primary‑key updates, and better extensibility, making it suitable for the stream‑batch hybrid workload.
The team built a primary‑key table that buckets data by key, ensuring ordered files; writes are Append‑Only while reads perform Merge‑On‑Read, enabling high‑throughput upserts with low latency. Multi‑stream column‑level merging was achieved by storing per‑column timestamps in a special column and merging on the storage side, supporting overlapping column updates without extra write‑time overhead.
Storage management was optimized through adaptive compaction: files are merged based on bucket size, file count, and update‑ratio thresholds, reducing small‑file proliferation and cutting storage redundancy by about 60% while maintaining query performance.
Feature publishing to the online KV system saw latency cut by nearly half, and streaming ingestion latency improved from hours to minutes; historical feature back‑fill became ten times faster. Custom KV‑Bucket transforms enabled MapOnly reads, and Spark Storage‑Partitioned Join (SPJ) eliminated shuffle for multi‑table joins, boosting join performance by ~50%.
Versioned branches (Tmp, Main, History) and Iceberg procedures were introduced to provide unified, fine‑grained read/write interfaces, supporting CDC‑level minute‑granular rollbacks and efficient archival.
Future work will continue to refine stream‑batch read/write performance, reduce data visibility latency, and implement smarter, adaptive merge logic for further storage and query efficiency.
Tencent Advertising Technology
Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.