Big Data 15 min read

Understanding Apache Hudi: Incremental Processing and Low‑Latency Data Management on Hadoop

This article explains how Apache Hudi provides an incremental processing framework that enables efficient, low‑latency data ingestion, storage, and query capabilities on Hadoop, detailing its architecture, storage layout, compaction, write and read paths, and support for real‑time and batch analytics.

Big Data Technology Architecture

May 10, 2020

Understanding Apache Hudi: Incremental Processing and Low‑Latency Data Management on Hadoop

With storage formats such as Apache Parquet and ORC and query engines like Presto and Apache Impala, the Hadoop ecosystem can serve as a unified service layer for minute‑level latency scenarios, but this requires efficient, low‑latency data ingestion and preparation within HDFS.

To address this, Uber developed the open‑source Hudi project, an incremental processing framework that efficiently supports all business‑critical data pipelines with low latency.

Motivation

The Lambda architecture relies on both a streaming layer and a batch layer, leading to duplicated computation and complex service systems. The Kappa architecture removes the batch layer but still leaves service‑layer complexity, especially when supporting row‑level updates and analytical scans.

Using HDFS as a unified service layer therefore demands fast ingestion, change‑log support, and partitioned, compressed, deduplicated state management, with capabilities such as rapid changes to large datasets, columnar storage optimization, and effective joins.

Hudi Overview

Hudi is a scan‑optimized storage abstraction for analytical workloads that enables minute‑level latency updates on HDFS and supports downstream incremental processing.

Hudi Dataset Storage

Hudi’s directory layout mirrors Hive: a root directory contains partitions as sub‑folders, each file identified by a unique fileId and commit timestamp. Metadata is stored as a timeline (commits, cleans, compactions). Indexing is plugin‑based (Bloom filter or HBase), and data can be stored in columnar (Parquet) or row‑oriented (Avro) formats.

Compaction

Compaction converts write‑optimized (row‑oriented) files to read‑optimized (columnar) files, aligning file sizes with HDFS block size and balancing parallelism, query performance, and file count; the process is extensible via plugins.

Write Path

Hudi runs as a Spark library, typically using Spark Streaming micro‑batches of 1–2 minutes (or scheduled via Oozie/Airflow). Ingestion loads Bloom filters, maps record keys to fileIds, groups inserts by partition, appends to log files until the HDFS block size is reached, then creates new fileIds. Background compaction rewrites log files into Parquet, and each successful write creates a commit entry in the timeline.

HDFS Block Alignment & Fault Recovery

Hudi aligns file sizes with HDFS block boundaries; partial writes or failed compactions are handled by storing offsets in commit metadata and filtering out incomplete files during query time, with failed compactions rolled back in the next cycle.

Read Path

Two InputFormats are provided: HoodieReadOptimizedInputFormat for a scan‑optimized view that reads the latest Parquet files, and HoodieRealtimeInputFormat for a real‑time view that merges log files with Parquet data. Both extend Parquet input classes and are usable by Presto and Spark SQL.

Incremental Processing

Because each commit records timestamps and file versions, Hudi can extract incremental changes between a start and end timestamp, enabling watermark‑based joins and upserts on HDFS‑based model tables.

References: [1] https://eng.uber.com/hoodie/ [2] https://whatis.techtarget.com/definition/data-ingestion [3] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 [4] https://github.com/uber/hudi

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

low latency Hadoop Hudi data ingestion

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.