Big Data 18 min read

Integrating Apache Flink with Apache Hudi: From Data Warehouse to Data Lake

This article explains how Apache Flink integrates with Apache Hudi to enable real‑time data lake ingestion, covering the evolution from traditional data warehouses to data lakes, Hudi’s core concepts such as timeline and file grouping, copy‑on‑write vs merge‑on‑read modes, and Flink’s CDC‑based ETL pipeline.

DataFunTalk

May 24, 2022

Integrating Apache Flink with Apache Hudi: From Data Warehouse to Data Lake

01 Data Warehouse to Data Lake

In recent years, data lakes have become popular as a flexible alternative to traditional data warehouses. Early warehouses (e.g., Teradata, Vertica) used tightly coupled storage‑compute architectures with proprietary formats. Cloud‑based warehouses (EMR, Redshift, etc.) introduced storage‑compute separation but still relied on closed formats. Since 2018, open‑source lake formats such as Hudi, Iceberg, and Delta Lake have emerged.

Data lakes provide a table‑format abstraction that operates directly on object storage, supporting a wide range of query engines (Presto, SparkSQL, Hive). They also expose transactional and upsert capabilities, which are essential for modern real‑time analytics.

02 Database to Warehouse/Lake

Traditional batch pipelines pull data from source databases (e.g., MySQL) via full extracts and incremental binlog merges. Modern CDC tools (Debezium, Canal) stream binlog events in near‑real time, but require downstream storage that supports upsert semantics. Hudi offers mature upsert support, enabling CDC pipelines that write directly to the lake without an intermediate Kafka layer, or with Kafka as a buffer for large historical loads.

03 Hudi Core Concepts

1. Timeline – Every Hudi action (commit, clean, compaction, rollback, savepoint) is recorded as an Instant with a unique timestamp, an action type, and a state (requested, inflight, completed). The timeline provides a coherent view of all operations and enables consistent reads.

2. File Grouping – Hudi groups files within a partition into logical file groups (similar to Hive buckets) identified by a UUID in the file name. This reduces small‑file pressure on HDFS and improves read efficiency.

3. Copy‑On‑Write (COW) – Writes generate new Parquet files; each new version merges with existing data, producing a fresh snapshot. Primary‑key and pre‑combine key determine record merging (default replace strategy).

4. Merge‑On‑Read (MOR) – Writes append log (Avro) files to existing file groups; compaction later merges logs into Parquet. MOR offers higher write throughput but slower reads, requiring periodic compaction.

5. Flink Write Pipeline (COW)

The pipeline ingests raw data from Flink SQL, converts it to Hudi records, shuffles by primary key to ensure each task writes to a single file group, and buffers data before flushing to disk based on buffer size, total buffer size, or checkpoint events. A coordinator handles metadata aggregation and transaction commit, and synchronizes Hive metastore partitions.

04 Flink + Hudi ETL

By leveraging Flink CDC and Hudi’s upsert capability, a near‑real‑time ODS layer can be built. Copy‑on‑write is currently recommended for stability, while MOR is being improved with offline compaction jobs.

05 Q&A

Hudi vs. Iceberg/Delta Lake: Hudi provides the most mature upsert support and rich lake‑management tools, though its write pipeline is heavier.

Recommended mode: Copy‑on‑write for most workloads due to predictable memory usage.

Version stability: Hudi 0.9’s COW mode is stable; MOR will become stable after compaction improvements.

PrestoDB now supports both COW and MOR snapshot reads.

Overall, the integration of Flink with Hudi enables a flexible, transactional data lake architecture that bridges traditional warehouses and modern streaming analytics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink ETL Data Lake CDC Hudi

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.