Understanding Apache Hudi: Incremental Processing and Low‑Latency Data Management on Hadoop
This article explains how Apache Hudi provides an incremental processing framework that enables efficient, low‑latency data ingestion, storage, and query capabilities on Hadoop, detailing its architecture, storage layout, compaction, write and read paths, and support for real‑time and batch analytics.
With storage formats such as Apache Parquet and ORC and query engines like Presto and Apache Impala, the Hadoop ecosystem can serve as a unified service layer for minute‑level latency scenarios, but this requires efficient, low‑latency data ingestion and preparation within HDFS.
To address this, Uber developed the open‑source Hudi project, an incremental processing framework that efficiently supports all business‑critical data pipelines with low latency.
Motivation
The Lambda architecture relies on both a streaming layer and a batch layer, leading to duplicated computation and complex service systems. The Kappa architecture removes the batch layer but still leaves service‑layer complexity, especially when supporting row‑level updates and analytical scans.
Using HDFS as a unified service layer therefore demands fast ingestion, change‑log support, and partitioned, compressed, deduplicated state management, with capabilities such as rapid changes to large datasets, columnar storage optimization, and effective joins.
Hudi Overview
Hudi is a scan‑optimized storage abstraction for analytical workloads that enables minute‑level latency updates on HDFS and supports downstream incremental processing.
Hudi Dataset Storage
Hudi’s directory layout mirrors Hive: a root directory contains partitions as sub‑folders, each file identified by a unique fileId and commit timestamp. Metadata is stored as a timeline (commits, cleans, compactions). Indexing is plugin‑based (Bloom filter or HBase), and data can be stored in columnar (Parquet) or row‑oriented (Avro) formats.
Compaction
Compaction converts write‑optimized (row‑oriented) files to read‑optimized (columnar) files, aligning file sizes with HDFS block size and balancing parallelism, query performance, and file count; the process is extensible via plugins.
Write Path
Hudi runs as a Spark library, typically using Spark Streaming micro‑batches of 1–2 minutes (or scheduled via Oozie/Airflow). Ingestion loads Bloom filters, maps record keys to fileIds, groups inserts by partition, appends to log files until the HDFS block size is reached, then creates new fileIds. Background compaction rewrites log files into Parquet, and each successful write creates a commit entry in the timeline.
HDFS Block Alignment & Fault Recovery
Hudi aligns file sizes with HDFS block boundaries; partial writes or failed compactions are handled by storing offsets in commit metadata and filtering out incomplete files during query time, with failed compactions rolled back in the next cycle.
Read Path
Two InputFormats are provided: HoodieReadOptimizedInputFormat for a scan‑optimized view that reads the latest Parquet files, and HoodieRealtimeInputFormat for a real‑time view that merges log files with Parquet data. Both extend Parquet input classes and are usable by Presto and Spark SQL.
Incremental Processing
Because each commit records timestamps and file versions, Hudi can extract incremental changes between a start and end timestamp, enabling watermark‑based joins and upserts on HDFS‑based model tables.
References: [1] https://eng.uber.com/hoodie/ [2] https://whatis.techtarget.com/definition/data-ingestion [3] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 [4] https://github.com/uber/hudi
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.