Big Data 14 min read

Building a Real-Time Data Lake with Hudi: Architecture, Challenges, and Practices

This article presents Huawei's end‑to‑end solution for constructing a real‑time data lake on Hudi, covering requirement analysis, technology selection, architectural design, ingestion and processing challenges, practical optimizations, and future improvement directions.

DataFunSummit

Apr 25, 2023

Building a Real-Time Data Lake with Hudi: Architecture, Challenges, and Practices

Introduction – The article introduces Huawei's overall solution for building a data lake based on Apache Hudi, outlining the business scenario of ERP processes (supply, procurement, payment, shipping) and the associated requirements such as low latency, high reliability, and support for updates and schema evolution.

1. Requirement Analysis & Technology Selection – Data sources are relational databases with frequent updates and deletions, heavy table joins, and variable traffic. Processing must meet sub‑20‑second write latency, sub‑5‑minute end‑to‑end ETL, batch‑stream convergence, data back‑tracking, and low development cost. Reliability demands full‑link monitoring, alerting, and cross‑region disaster recovery.

2. Technical Architecture Selection – After comparing Iceberg and Hudi, Hudi was chosen for its real‑time capabilities, update support, ACID guarantees, and data ordering. The architecture shifts from traditional ETL to ELT, adopts incremental batch plus stream processing, and replaces Kafka storage with Hudi for unified persistence. The system features four integration channels (batch and real‑time), a unified metadata layer, a single compute engine (FlinkSQL on Hudi), and code consistency across batch and streaming.

3. Real‑Time Data Ingestion – Challenges such as Bloom filter performance, strict ordering, idempotent writes, fault‑tolerant ingestion, and overload protection are addressed by using partitioned tables with MOR and Bucket indexing, pre‑combine fields based on transaction timestamps, heartbeat tables for completeness, and dynamic resource scaling.

4. Real‑Time Data Processing – Data is stored in layered warehouses (source → detail → summary → application). FlinkSQL handles most streaming workloads, while SparkSQL processes monthly/annual aggregates. Issues like flow stability, cross‑center data alignment, and consistency between Spark and Flink writes are mitigated through layered modeling, record‑level throttling, bucket‑based lookups, MOR read optimizations, and TTL adjustments.

5. Real‑Time Query – BI reports and data exploration are served via ClickHouse for low‑latency queries and HetuEngine on Hudi for exploratory analysis, with performance enhancements such as dynamic filtering, push‑down computation, and cost‑based optimization.

6. Project Experience – Initial use of Bloom filter indexes caused performance degradation; switching to Bucket indexes and partitioned tables improved write throughput. Compaction, cleaning, and archiving services were decoupled and run asynchronously to reduce resource consumption.

7. Typical Development Patterns – Two patterns are described: (a) streaming processing with auxiliary correction streams for data anomalies, and (b) incremental batch processing for fact‑dimension joins, using union‑based补数 (re‑processing) when needed.

8. Future Improvements – Focus areas include enhancing Hudi read/write performance to achieve sub‑second latency at million‑TPS scale, improving indexing and statistics, increasing self‑management to lower maintenance cost, and implementing elastic scaling for resource efficiency.

9. Q&A – Answers address heartbeat‑table consistency, Hudi's role as storage accessed via compute engines (Flink, Spark, Hetu), and bucket count sizing recommendations (≈2 GB per bucket).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Real-time Processing Flink data lake Spark Hudi ETL/ELT

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.