Understanding Apache Hudi: Incremental Processing and Low‑Latency Data Management on Hadoop
This article explains how Apache Hudi enables efficient, low‑latency incremental data ingestion and processing on Hadoop by providing a unified service layer, describing its motivation, architecture, storage components, write and read paths, compaction, fault recovery, and incremental query capabilities.
With the evolution of storage formats such as Apache Parquet and Apache ORC and query engines like Presto and Apache Impala, the Hadoop ecosystem can serve as a unified service layer for minute‑level latency scenarios, provided that data ingestion and preparation on HDFS are efficient and low‑latency.
To address this, Uber created the Hudi project, an incremental processing framework that supports fast, low‑latency data pipelines and is open‑sourced at https://github.com/uber/hudi . Before diving into Hudi, the article discusses why using Hadoop as a unified service layer makes sense.
Motivation
The Lambda architecture relies on both a streaming layer and a batch layer, requiring separate serving layers for real‑time and batch results, which adds complexity. The Kappa architecture removes the batch layer, but service‑layer challenges remain, especially when row‑level updates and analytical query performance are needed.
When latency requirements are modest (e.g., ~10 minutes), a fast ingestion path on HDFS can eliminate the need for a separate “Speed Serving” layer, simplifying the overall system.
Using HDFS as a unified service layer requires support for change logs, partitioning, compression, and deduplication based on business dimensions, along with three key capabilities:
Fast change capability for large HDFS datasets
Column‑store optimized for analytical scans
Efficient joins and propagation of updates to downstream models
Hudi Overview
Hudi is an incremental processing framework that satisfies all the above requirements. It provides a scan‑optimized, column‑store abstraction that enables minute‑level latency updates on HDFS datasets and supports downstream incremental processing.
Hudi datasets are compatible with the Hadoop ecosystem via a custom InputFormat, allowing seamless integration with Apache Hive, Presto, and Apache Spark.
Hudi Storage Architecture
Hudi storage consists of three parts:
Metadata – Maintained as a timeline in the root metadata directory, including commits, cleans, and compactions.
Index – Plugin‑based index (e.g., Bloom filter, Apache HBase) that maps record keys to fileId s.
Data – Stores ingested data in either a read‑optimized columnar format (default Parquet) or a write‑optimized row format (default Avro).
Each file has a unique fileId and a commit identifier; updates share the same fileId but have different commit s.
Writing Hudi Files
Compaction
Compaction converts write‑optimized files to read‑optimized files. The basic parallel unit is a rewrite of a single fileId, aligning file sizes with HDFS block sizes to balance compaction, query parallelism, and total file count.
Write Path
Hudi runs as a Spark library, typically using 1–2 minute micro‑batches. The default write flow is:
Load Bloom filter indexes from Parquet files, map incoming keys to target files, and decide between insert or update.
Group inserts by partition, assign a fileId, and append to log files until the HDFS block size is reached, then generate a new fileId.
Background compaction runs periodically, merging log files into new Parquet versions; priorities are based on log size.
For each fileId, append to existing log files or create new ones.
On successful ingestion, a commit is recorded in the metadata timeline, renaming inflight files to committed files and storing partition and fileId details.
HDFS Block Alignment
Hudi strives to align file sizes with HDFS block sizes. Depending on partition volume and columnar compression, compaction may still produce small Parquet files, but repeated iterations gradually grow files to match block size.
Failure Recovery
Spark’s retry mechanism handles transient errors; if retries exceed thresholds, the job fails and is retried on the next iteration. Two key failure scenarios are addressed:
Partial Avro blocks in log files are skipped using commit metadata that records start offsets and log versions.
Partial Parquet files from failed compactions are filtered out during query time using commit metadata, with failed files rolled back in the next compaction cycle.
Reading Hudi Files
The commit timeline enables both read‑optimized and real‑time views. Clients choose a view based on latency and query performance requirements.
Hudi provides two custom InputFormat s: HoodieReadOptimizedInputFormat – Scans only the latest Parquet files. HoodieRealtimeInputFormat – Merges log files with the latest Parquet files using a RecordReader for real‑time visibility.
Both extend MapredParquetInputFormat and VectorizedParquetRecordReader, preserving all Parquet optimizations. The hoodie-hadoop-mr library enables Presto and Spark SQL to query Hudi tables directly.
Incremental Processing
Because Hudi records commit timestamps and file versions in its metadata, users can extract incremental changes by specifying start and end timestamps. The query engine pushes these predicates down to the file scan stage, and automatic cleaning may prune unavailable files.
This enables watermark‑based stream‑stream joins and stream‑static joins, supporting upserts on HDFS‑stored model tables.
References
[1] https://eng.uber.com/hoodie/
[2] https://whatis.techtarget.com/definition/data-ingestion
[3] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
[4] https://github.com/uber/hudi
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
