Big Data 12 min read

Comprehensive Hudi Real-Time Data Lake Ingestion Solutions

This article presents a complete guide to Hudi-based real-time data lake ingestion, covering overall data integration architecture, batch and streaming ingestion strategies, advanced table design, and practical recommendations for handling challenges such as deduplication, latency, partitioning, and performance optimization.

DataFunSummit
DataFunSummit
DataFunSummit
Comprehensive Hudi Real-Time Data Lake Ingestion Solutions

Overview

The article introduces a step‑by‑step Hudi real‑time data lake ingestion solution, starting from basic concepts and progressing to advanced techniques.

1. Data Integration Overall Solution

Three integration modes are described: direct JDBC/Sqoop loading into Hive, CDC‑based log capture into CDC‑compatible storage, and file/message‑queue based ingestion with downstream processing via Spark/Flink. Batch and streaming ingestion modes are distinguished, with batch handling millions of records and streaming handling tens of thousands of TPS with second‑level latency.

2. Batch Ingestion Solution

Challenges include data duplication, JDBC performance impact, file coordination, and data aging.

Recommendations: use Hudi's row‑level upsert for deduplication, partition tables, read from replicas or use file‑based ingestion, compress files and flag upload status, and apply throttling with monitoring to avoid data loss.

3. Real‑Time Ingestion Solution

Key characteristics: high frequency, low volume per batch, includes inserts/updates/deletes, requires fast computation for real‑time decisions, and must limit resource consumption.

Challenges: limited scalability of Flink direct source connections, high development cost of Spark source connections, and need for ordered DDL/DML.

Recommendations: adopt professional CDC tools for multi‑table capture, ensure DDL precedes source changes, and use Hudi's upsert/append modes.

4. Hudi Table Model Design

Hudi supports COW and MOR storage formats; MOR is preferred for low‑latency real‑time scenarios.

Indexing strategies: bucket index for large tables, simple or Bloom index for smaller tables, and appropriate partitioning (date‑based for streaming, coarse‑grained for dimension tables).

Write modes: upsert for automatic deduplication, append for log‑type data.

5. Real‑Time General Solution

Overall architecture uses Hudi on HDFS or cloud storage, Spark for batch补数, and Flink for streaming.

Hudi provides table services for cleaning small files and timeline management.

Partition TTL can be configured; batch writes may run TTL synchronously, while streaming writes use asynchronous clean tasks.

6. Real‑Time Advanced Solutions

ChangeLog : stores row‑level change events, enabling downstream Flink jobs to consume only the latest state.

High‑Speed Stream Table : leverages HDMS to asynchronously write Kafka data into Hudi stream tables, reducing Kafka data aging.

Column Cluster : groups frequently updated columns for MOR tables with bucket index, improving write concurrency.

MOW (Merge on Write) : combines advantages of MOR (fast reads) and COW (fast writes) using bitmap‑based deletions.

MDT (Metadata Index Table) : builds a partition‑file index cached in a JDBC server to accelerate queries on large COW tables.

The article concludes with a summary of the presented solutions and thanks the audience.

Big DataBatch Processingdata lakeHudiReal-time Ingestion
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.