Incremental Data Lake Design and Hudi Core Optimizations with Flink
The article describes how combining Apache Flink with Hudi enables an incremental data lake that delivers near‑real‑time analytics by switching to merge‑on‑read, fixing log handling bugs, improving compaction planning, and refactoring table‑service scheduling, while showcasing use cases such as CDC ingestion, data quality control, and real‑time materialized views, and outlines future enhancements like optimistic concurrency and unified schema evolution.
This article presents the design, challenges, and optimization techniques of an incremental data lake built on Apache Hudi and Flink, targeting near‑real‑time analytics scenarios such as live streaming, recommendation, and content moderation.
Background : Real‑time data is increasingly valuable. Traditional data warehouses with hour‑ or day‑level partitions cannot meet the timeliness requirements of many business scenarios. The authors encountered three main pain points: (1) insufficient timeliness, (2) complex data integration and synchronization, and (3) inconsistent storage media for batch‑and‑stream processing.
Proposed Solution : Adopt Flink + Hudi as the core of the incremental data lake. Hudi provides ACID guarantees, upsert support, and efficient incremental reads, while Flink offers unified stream‑batch computation. Compared with Iceberg, Hudi’s upsert and small‑file merging capabilities were deemed more suitable for the use case.
Hudi Kernel Optimizations :
Compaction mode: transition from COW (copy‑on‑write) to MOR (merge‑on‑read) to reduce I/O amplification.
Log handling bugs: fix log‑block skipping that caused data loss, improve compaction‑and‑rollback merging to avoid duplicate records, and ensure the last data batch triggers compaction to prevent RO‑table latency.
CompactionPlanOperator improvements: process all pending compaction plans instead of only the latest, and switch to a synchronous execution model to reduce plan backlog.
Table Service Enhancements : Refactor table‑service scheduling and execution, introduce a configurable inlineScheduleEnabled flag, and support dynamic configuration providers for external policy integration. The redesign separates ingestion jobs from orchestration jobs, enabling standalone, inline‑sync, and inline‑async modes.
Practical Scenarios :
Incremental Data Warehouse: Use Hudi source in Flink to perform incremental consumption across ODS → DWD → ADS layers, achieving minute‑level data visibility.
CDC to Hudi: Replace slow MySQL DataX sync with Flink CDC (full + incremental) feeding Kafka and then Hudi, supporting upserts, deletes, and minute‑level visibility.
Real‑time Data Quality Control (DQC): Dump Kafka streams into Hudi tables, enabling minute‑level monitoring that aligns with offline DQC logic.
Real‑time Materialization: Define materialized views via SQL, let Flink aggregate detail data and upsert results into Hudi materialized tables, providing sub‑second query latency within the HDFS ecosystem.
Future Outlook : Enhance Hudi core capabilities (optimistic concurrency control, richer schema evolution, meta‑server for unified real‑time/offline tables), and migrate weak‑real‑time workloads from Kafka to Hudi to leverage scalable file‑system storage.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.