Flink & Apache Hudi: Design, Practices, and Roadmap for Streaming Data Lakes
This talk introduces the evolution of data lakes, outlines Apache Hudi’s core features, details the Flink‑Hudi integration architecture—including write pipelines, small‑file handling, and read strategies—covers real‑world use cases such as near‑real‑time DB ingestion, OLAP, and ETL, and previews upcoming Hudi roadmap items.
Background and Core Characteristics of Apache Hudi
Apache Hudi is a cloud‑native data‑lake format that provides open integration with upstream sources (change‑log databases, message queues) and downstream query engines (Presto, Trino, Hive, StarRocks, Redshift, Impala). Its four defining traits are:
Openness : supports many source formats and can be queried by a wide range of engines.
Rich ACID transaction support : file‑level ACID semantics enable minute‑ or second‑level data freshness without full‑table overwrites.
Incremental processing : a built‑in timeline and TimeTravel API allow stream‑compatible incremental ETL and point‑in‑time queries.
Intelligent file management : automatic bucket sizing, compaction and cleaning reduce small‑file explosion compared with formats such as Delta Lake or Iceberg.
Flink + Hudi Architecture
Write pipeline (serverless micro‑service)
The pipeline consists of several Flink operators that together provide a serverless, self‑governing service:
ACID transaction layer : guarantees exactly‑once semantics, fast rollback, and coordinated commits via a coordinator that aligns with Flink checkpoints.
Copy‑on‑Write (COW) implementation : incoming records are converted to Hudi’s internal format, shuffled by a generated bucket_id, and written per bucket to preserve consistency.
Bucket assignment : inserts are placed into the bucket with the most remaining space; updates locate the bucket that already contains the primary key, keeping file size stable.
Shuffle by bucket ID : prevents concurrent writes to the same bucket, avoiding update conflicts.
Concurrency tuning : a common practice is to allocate 1 write task per 2 buckets under high load, balancing throughput and small‑file generation.
Small‑file strategy
Each bucket targets a size of roughly 120 MB. When a bucket reaches the threshold, new buckets are created. Insert operations select the bucket with the most free space; update operations write to the bucket that already stores the record’s key, so file growth is limited. Bucket‑assign tasks are distributed using a hash of the bucket ID, enabling parallel decision making and controlling the trade‑off between write throughput and small‑file count.
Full‑ and incremental‑read mechanisms
Hudi maintains a timeline of commit timestamps. Full reads scan all files, optionally using a metadata table (KV store) for faster file discovery. Incremental reads poll the timeline (default every 60 s) and emit only new commits downstream. Recent releases also support batch‑mode TimeTravel queries that read a consistent snapshot at a specific point in time.
Typical Application Scenarios
Near‑real‑time DB ingestion : CDC tools such as Debezium or Maxwell stream change logs into Hudi, achieving minute‑level freshness while handling out‑of‑order updates with ACID guarantees.
Near‑real‑time OLAP : a single lake table serves multiple engines (Presto, Trino, StarRocks, Redshift, Impala), eliminating double‑write pipelines and providing a unified truth layer.
Near‑real‑time ETL : Hudi acts as both storage and a queue abstraction, enabling incremental processing without a separate Kafka layer and simplifying Lambda/Kappa architectures.
Alibaba Cloud VVP real‑time ingestion : VVP bundles a built‑in Flink version with a Hudi connector, supporting schema evolution, CTAS, and DLF catalog metadata management for seamless EMR querying.
Upcoming Hudi Roadmap (0.12 / 1.0)
CDC Feed : accepts non‑CDC inputs and tolerates missing CDC records, trading some throughput for broader flexibility.
Meta Service : pluggable metadata management that unifies table and job handling.
Secondary Index : column‑level min/max statistics and future LSM‑style indexes to accelerate point queries.
Column‑level update (Feature Store) : efficient per‑column updates to support high‑dimensional feature‑engineering workloads.
Practical Guidance
Hudi vs. StarRocks for CDC ingestion
StarRocks offers higher QPS due to LSM‑based primary‑key indexing but incurs higher operational cost and memory usage. Hudi’s serverless architecture and open integration make it more suitable for data‑warehouse scenarios where flexibility and low‑cost operation are priorities.
Choosing table type for high‑volume streaming
For workloads below ~20 k QPS, a Copy‑on‑Write (COW) table is sufficient. When QPS exceeds this threshold, a Merge‑on‑Read (MOR) table with online compaction is recommended to avoid write amplification.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
