Big Data 21 min read

Flink & Apache Hudi: Design, Practices, and Roadmap for Streaming Data Lakes

This talk introduces the evolution of data lakes, outlines Apache Hudi’s core features, details the Flink‑Hudi integration architecture—including write pipelines, small‑file handling, and read strategies—covers real‑world use cases such as near‑real‑time DB ingestion, OLAP, and ETL, and previews upcoming Hudi roadmap items.

ITPUB
ITPUB
ITPUB
Flink & Apache Hudi: Design, Practices, and Roadmap for Streaming Data Lakes

Background and Core Characteristics of Apache Hudi

Apache Hudi is a cloud‑native data‑lake format that provides open integration with upstream sources (change‑log databases, message queues) and downstream query engines (Presto, Trino, Hive, StarRocks, Redshift, Impala). Its four defining traits are:

Openness : supports many source formats and can be queried by a wide range of engines.

Rich ACID transaction support : file‑level ACID semantics enable minute‑ or second‑level data freshness without full‑table overwrites.

Incremental processing : a built‑in timeline and TimeTravel API allow stream‑compatible incremental ETL and point‑in‑time queries.

Intelligent file management : automatic bucket sizing, compaction and cleaning reduce small‑file explosion compared with formats such as Delta Lake or Iceberg.

Flink + Hudi Architecture

Write pipeline (serverless micro‑service)

The pipeline consists of several Flink operators that together provide a serverless, self‑governing service:

ACID transaction layer : guarantees exactly‑once semantics, fast rollback, and coordinated commits via a coordinator that aligns with Flink checkpoints.

Copy‑on‑Write (COW) implementation : incoming records are converted to Hudi’s internal format, shuffled by a generated bucket_id, and written per bucket to preserve consistency.

Bucket assignment : inserts are placed into the bucket with the most remaining space; updates locate the bucket that already contains the primary key, keeping file size stable.

Shuffle by bucket ID : prevents concurrent writes to the same bucket, avoiding update conflicts.

Concurrency tuning : a common practice is to allocate 1 write task per 2 buckets under high load, balancing throughput and small‑file generation.

Small‑file strategy

Each bucket targets a size of roughly 120 MB. When a bucket reaches the threshold, new buckets are created. Insert operations select the bucket with the most free space; update operations write to the bucket that already stores the record’s key, so file growth is limited. Bucket‑assign tasks are distributed using a hash of the bucket ID, enabling parallel decision making and controlling the trade‑off between write throughput and small‑file count.

Full‑ and incremental‑read mechanisms

Hudi maintains a timeline of commit timestamps. Full reads scan all files, optionally using a metadata table (KV store) for faster file discovery. Incremental reads poll the timeline (default every 60 s) and emit only new commits downstream. Recent releases also support batch‑mode TimeTravel queries that read a consistent snapshot at a specific point in time.

Hudi full and incremental read
Hudi full and incremental read

Typical Application Scenarios

Near‑real‑time DB ingestion : CDC tools such as Debezium or Maxwell stream change logs into Hudi, achieving minute‑level freshness while handling out‑of‑order updates with ACID guarantees.

Near‑real‑time OLAP : a single lake table serves multiple engines (Presto, Trino, StarRocks, Redshift, Impala), eliminating double‑write pipelines and providing a unified truth layer.

Near‑real‑time ETL : Hudi acts as both storage and a queue abstraction, enabling incremental processing without a separate Kafka layer and simplifying Lambda/Kappa architectures.

Alibaba Cloud VVP real‑time ingestion : VVP bundles a built‑in Flink version with a Hudi connector, supporting schema evolution, CTAS, and DLF catalog metadata management for seamless EMR querying.

VVP real‑time ingestion
VVP real‑time ingestion

Upcoming Hudi Roadmap (0.12 / 1.0)

CDC Feed : accepts non‑CDC inputs and tolerates missing CDC records, trading some throughput for broader flexibility.

Meta Service : pluggable metadata management that unifies table and job handling.

Secondary Index : column‑level min/max statistics and future LSM‑style indexes to accelerate point queries.

Column‑level update (Feature Store) : efficient per‑column updates to support high‑dimensional feature‑engineering workloads.

Practical Guidance

Hudi vs. StarRocks for CDC ingestion

StarRocks offers higher QPS due to LSM‑based primary‑key indexing but incurs higher operational cost and memory usage. Hudi’s serverless architecture and open integration make it more suitable for data‑warehouse scenarios where flexibility and low‑cost operation are priorities.

Choosing table type for high‑volume streaming

For workloads below ~20 k QPS, a Copy‑on‑Write (COW) table is sufficient. When QPS exceeds this threshold, a Merge‑on‑Read (MOR) table with online compaction is recommended to avoid write amplification.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkStreamingData Lakereal-time ETLApache Hudi
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.