Big Data 14 min read

Bilibili's Practice of Building a Streaming Data Lake with Hudi and Flink

This article details Bilibili's implementation of a streaming data lake using Hudi and Flink, covering background challenges, four case studies, batch‑stream integration optimizations, infrastructure and kernel enhancements, and future work directions.

DataFunSummit

Aug 26, 2023

Bilibili's Practice of Building a Streaming Data Lake with Hudi and Flink

The presentation shares Bilibili's practical experience of constructing a streaming data lake based on Hudi and Flink, focusing on the problems encountered during batch‑stream integration and the corresponding optimization solutions.

Background and Challenges : Bilibili's real‑time data warehouse consists of acquisition, processing, and AI/BI layers, with a dual batch‑stream architecture that brings high maintenance cost, poor observability, data silos, and low query efficiency.

Solution Overview : Introducing a data lake to enable efficient data flow, unified data management, and accelerated analytics through clustering, indexing, materialized views, and Alluxio caching.

Typical Cases :

1. RDB One‑Click Lake : Replaced DataX+Hive with CDC+Hudi, handling out‑of‑order data, schema evolution, and data drift via a snapshot view and timeline‑based versioning.

2. Traffic Log Splitting : Adopted Hudi Append to replace Hive, implemented dynamic BU‑level routing at the transport layer, and introduced logical partitioning with view‑based subscription to improve timeliness and isolation.

3. Materialized Query Acceleration : Added Flink materialized view support, enabling hint‑driven incremental consumption of Hudi tables and reducing query latency while providing fallback to source queries.

4. Real‑Time Warehouse Evolution : Explored further scenarios such as Hudi replacing Kafka, near‑real‑time data quality checks, and direct BI access without data export.

Infrastructure and Kernel Optimizations :

TableService Optimization : Externalized table services via Hudi Manager for better stability, flexible policy updates, and platform‑wide management.

Partition Advancement Support : Implemented a watermark‑driven two‑step commit (arrival and ready) with Hive Metastore extensions and Flink state tracking to synchronize batch and stream partitions.

Enhanced Data Rollback : Integrated Instant Rollback and Savepoint Rollback mechanisms, using Spark procedures for metadata repair and providing one‑click re‑run capabilities.

Future Work Outlook : Plans include strengthening core data‑lake capabilities (e.g., dimension tables, lock‑free updates), unifying metastore management with Hudi Manager, expanding batch‑stream unified scenarios, and further platform integration to improve user experience.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Data Lake Hudi Batch-Stream Integration Streaming Data Lake

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.