Bilibili's Iceberg‑Based Streaming‑Batch Integration: Architecture, Optimizations, and Practice
This article presents Bilibili's end‑to‑end exploration of a streaming‑batch unified data pipeline built on Apache Iceberg, detailing the original and iterated architectures for massive user behavior transmission, online AI training, DB synchronization, and dimension‑join, along with performance gains, cost savings, and future plans.
Introduction – Zhang Chenyi from Bilibili introduces a five‑part technical sharing on the company's Iceberg‑based streaming‑batch integration.
1. Massive User‑Behavior Data Transmission – The original architecture routes log, user‑behavior, and business‑DB data through log‑agents to a Kafka buffer, then splits into real‑time and offline paths (Kafka → Flink → Kafka DWD for real‑time, Kafka → Hive → Spark for offline). Limitations include higher latency for the offline path, higher storage cost for real‑time Kafka, and limited batch performance.
After iteration, the offline ODS‑to‑DWD layer is migrated from Hive to Iceberg, and Spark batch jobs are replaced by Flink streaming jobs that read Iceberg snapshots, reducing data freshness from hours to minutes and cutting storage duplication via Iceberg views.
Additional optimizations include automatic SQL conversion from Flink streaming to batch, reuse of UDFs, and reduced data duplication.
2. Commercial and AI Online Training – Real‑time feature pipelines generate training samples in Kafka, which are consumed by both real‑time and offline training. Problems such as divergent storage (Kafka vs Hive), duplicated APIs, and heavy PB deserialization are addressed by unifying storage to Iceberg, compressing data, and applying hot‑cold separation (SSD for recent data). This yields a 20% reduction in compute cost and improves latency.
3. DB Data Synchronization – Flink CDC streams full‑load and binlog data to Kafka, then writes to Iceberg MOR tables. Downstream Spark merge jobs produce Hive snapshots for users. Iterations introduce direct writes to Iceberg MOR tables, enabling minute‑level query latency, change‑log incremental reads, and enhanced file handling (Data, Position Delete, Equality Delete files). Additional keyed‑state caching prevents replay on restarts, and Magnus provides asynchronous compaction.
4. Iceberg Dimension Join – Traditional double‑stream joins to external DBs cause high QPS pressure and state loss on restarts. By leveraging Iceberg change‑log incremental reads, a new lookup join stores the full dimension table in Flink state, supports TTL‑free updates, and maintains primary‑key consistency. Adjustments to Calcite SQL hints, Flink project rules, and operator ordering ensure correct state updates and prevent null joins.
5. Q&A – Discusses conflict resolution between Magnus and Flink during compaction, noting that JobManager now triggers Magnus compaction at configurable intervals, achieving minute‑level compaction without impacting data freshness.
Conclusion – The migration to Iceberg and associated optimizations deliver approximately ¥3.55 million annual cost savings, a 20% CPU reduction, 22% memory reduction, and a 48.9% performance boost across the pipeline.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.