Bilibili's Practice of Building a Streaming Data Lake with Hudi and Flink
This article details Bilibili's implementation of a streaming data lake using Hudi and Flink, covering background challenges, four case studies, batch‑stream integration optimizations, infrastructure and kernel enhancements, and future work directions.
The presentation shares Bilibili's practical experience of constructing a streaming data lake based on Hudi and Flink, focusing on the problems encountered during batch‑stream integration and the corresponding optimization solutions.
Background and Challenges : Bilibili's real‑time data warehouse consists of acquisition, processing, and AI/BI layers, with a dual batch‑stream architecture that brings high maintenance cost, poor observability, data silos, and low query efficiency.
Solution Overview : Introducing a data lake to enable efficient data flow, unified data management, and accelerated analytics through clustering, indexing, materialized views, and Alluxio caching.
Typical Cases :
1. RDB One‑Click Lake : Replaced DataX+Hive with CDC+Hudi, handling out‑of‑order data, schema evolution, and data drift via a snapshot view and timeline‑based versioning.
2. Traffic Log Splitting : Adopted Hudi Append to replace Hive, implemented dynamic BU‑level routing at the transport layer, and introduced logical partitioning with view‑based subscription to improve timeliness and isolation.
3. Materialized Query Acceleration : Added Flink materialized view support, enabling hint‑driven incremental consumption of Hudi tables and reducing query latency while providing fallback to source queries.
4. Real‑Time Warehouse Evolution : Explored further scenarios such as Hudi replacing Kafka, near‑real‑time data quality checks, and direct BI access without data export.
Infrastructure and Kernel Optimizations :
TableService Optimization : Externalized table services via Hudi Manager for better stability, flexible policy updates, and platform‑wide management.
Partition Advancement Support : Implemented a watermark‑driven two‑step commit (arrival and ready) with Hive Metastore extensions and Flink state tracking to synchronize batch and stream partitions.
Enhanced Data Rollback : Integrated Instant Rollback and Savepoint Rollback mechanisms, using Spark procedures for metadata repair and providing one‑click re‑run capabilities.
Future Work Outlook : Plans include strengthening core data‑lake capabilities (e.g., dimension tables, lock‑free updates), unifying metastore management with Hudi Manager, expanding batch‑stream unified scenarios, and further platform integration to improve user experience.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.