Big Data 17 min read

Understanding Apache Hudi: Incremental Processing and Low‑Latency Data Management on Hadoop

This article explains how Apache Hudi enables efficient, low‑latency incremental data ingestion and processing on Hadoop by providing a unified service layer, describing its motivation, architecture, storage components, write and read paths, compaction, fault recovery, and incremental query capabilities.

Big Data Technology & Architecture

Jan 23, 2020

Understanding Apache Hudi: Incremental Processing and Low‑Latency Data Management on Hadoop

With the evolution of storage formats such as Apache Parquet and Apache ORC and query engines like Presto and Apache Impala, the Hadoop ecosystem can serve as a unified service layer for minute‑level latency scenarios, provided that data ingestion and preparation on HDFS are efficient and low‑latency.

To address this, Uber created the Hudi project, an incremental processing framework that supports fast, low‑latency data pipelines and is open‑sourced at https://github.com/uber/hudi . Before diving into Hudi, the article discusses why using Hadoop as a unified service layer makes sense.

Motivation

The Lambda architecture relies on both a streaming layer and a batch layer, requiring separate serving layers for real‑time and batch results, which adds complexity. The Kappa architecture removes the batch layer, but service‑layer challenges remain, especially when row‑level updates and analytical query performance are needed.

When latency requirements are modest (e.g., ~10 minutes), a fast ingestion path on HDFS can eliminate the need for a separate “Speed Serving” layer, simplifying the overall system.

Using HDFS as a unified service layer requires support for change logs, partitioning, compression, and deduplication based on business dimensions, along with three key capabilities:

Fast change capability for large HDFS datasets

Column‑store optimized for analytical scans

Efficient joins and propagation of updates to downstream models

Hudi Overview

Hudi is an incremental processing framework that satisfies all the above requirements. It provides a scan‑optimized, column‑store abstraction that enables minute‑level latency updates on HDFS datasets and supports downstream incremental processing.

Hudi datasets are compatible with the Hadoop ecosystem via a custom InputFormat, allowing seamless integration with Apache Hive, Presto, and Apache Spark.

Hudi Storage Architecture

Hudi storage consists of three parts:

Metadata – Maintained as a timeline in the root metadata directory, including commits, cleans, and compactions.

Index – Plugin‑based index (e.g., Bloom filter, Apache HBase) that maps record keys to fileId s.

Data – Stores ingested data in either a read‑optimized columnar format (default Parquet) or a write‑optimized row format (default Avro).

Each file has a unique fileId and a commit identifier; updates share the same fileId but have different commit s.

Writing Hudi Files

Compaction

Compaction converts write‑optimized files to read‑optimized files. The basic parallel unit is a rewrite of a single fileId, aligning file sizes with HDFS block sizes to balance compaction, query parallelism, and total file count.

Write Path

Hudi runs as a Spark library, typically using 1–2 minute micro‑batches. The default write flow is:

Load Bloom filter indexes from Parquet files, map incoming keys to target files, and decide between insert or update.

Group inserts by partition, assign a fileId, and append to log files until the HDFS block size is reached, then generate a new fileId.

Background compaction runs periodically, merging log files into new Parquet versions; priorities are based on log size.

For each fileId, append to existing log files or create new ones.

On successful ingestion, a commit is recorded in the metadata timeline, renaming inflight files to committed files and storing partition and fileId details.

HDFS Block Alignment

Hudi strives to align file sizes with HDFS block sizes. Depending on partition volume and columnar compression, compaction may still produce small Parquet files, but repeated iterations gradually grow files to match block size.

Failure Recovery

Spark’s retry mechanism handles transient errors; if retries exceed thresholds, the job fails and is retried on the next iteration. Two key failure scenarios are addressed:

Partial Avro blocks in log files are skipped using commit metadata that records start offsets and log versions.

Partial Parquet files from failed compactions are filtered out during query time using commit metadata, with failed files rolled back in the next compaction cycle.

Reading Hudi Files

The commit timeline enables both read‑optimized and real‑time views. Clients choose a view based on latency and query performance requirements.

Hudi provides two custom InputFormat s: HoodieReadOptimizedInputFormat – Scans only the latest Parquet files. HoodieRealtimeInputFormat – Merges log files with the latest Parquet files using a RecordReader for real‑time visibility.

Both extend MapredParquetInputFormat and VectorizedParquetRecordReader, preserving all Parquet optimizations. The hoodie-hadoop-mr library enables Presto and Spark SQL to query Hudi tables directly.

Incremental Processing

Because Hudi records commit timestamps and file versions in its metadata, users can extract incremental changes by specifying start and end timestamps. The query engine pushes these predicates down to the file scan stage, and automatic cleaning may prune unavailable files.

This enables watermark‑based stream‑stream joins and stream‑static joins, supporting upserts on HDFS‑stored model tables.

References

[1] https://eng.uber.com/hoodie/

[2] https://whatis.techtarget.com/definition/data-ingestion

[3] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

[4] https://github.com/uber/hudi

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

low-latency Hadoop Apache Hudi Incremental Processing data ingestion

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.