Big Data 10 min read

Understanding Apache Hudi: Concepts, Architecture, Usage, and Best Practices

This article introduces Apache Hudi, explains its architecture and storage models, describes how it enables upserts and incremental queries on Hadoop, provides step‑by‑step guidance for integrating Hudi with Apache Spark, and outlines best practices and comparisons with Apache Kudu.

Big Data Technology Architecture

Mar 16, 2020

Understanding Apache Hudi: Concepts, Architecture, Usage, and Best Practices

1. What is Hudi?

Apache Hudi (Hadoop Upserts and Incrementals) is an open‑source library that manages large analytical datasets on HDFS, reducing data latency during ingestion. Developed by Uber, it offers two table types: Read Optimized Table for columnar query performance and Near‑Real‑Time Table for low‑latency row‑plus‑column queries.

2. How does Hudi work?

Hudi provides primitives such as upserts and incremental consumption. It maintains a timeline of all operations, organizing data into partitioned directories similar to Hive tables. Files are versioned by unique IDs and commit timestamps, supporting copy‑on‑write, pure columnar storage, new file versions, merge‑on‑read, and near‑real‑time access.

Views determine how data is read:

Read Optimized View : Queries only compressed columnar files (e.g., Parquet) for high performance.

Near‑Real‑Time View : Combines row‑based and columnar storage for ~1‑5 minute latency.

Incremental View : Exposes only changed records since a given checkpoint, enabling efficient incremental pulls.

The storage layer consists of three parts:

Metadata : Timeline‑based metadata storing commits, clean, compaction, and index operations.

Index : Pluggable index (default Bloom filter) that maps record keys to files; optional HBase index for low‑cardinality keys.

Data : Two formats – a read‑optimized columnar format (default Parquet) and a write‑optimized row format (default Avro).

3. Why is Hudi important for large‑scale and near‑real‑time applications?

Hudi addresses Hadoop’s scalability limits, provides faster data presentation, adds native support for updates and deletes, accelerates ETL and modeling, and allows incremental queries using the latest checkpoint timestamp without scanning the entire source table.

4. Using Apache Spark with Hudi for data pipelines

4.1 Download Hudi

$ mvn clean install -DskipTests -DskipITs

$ mvn clean install -DskipTests -DskipITs -Dhive11

4.2 Version compatibility

Hudi requires Java 8 and works with Spark 2.x. Example compatibility matrix:

Hadoop

Hive

Spark

Build command

Apache Hadoop‑2.8.4

Apache Hive‑2.3.3

spark‑2.[1‑3].x

mvn clean install -DskipTests

Apache Hadoop‑2.7.3

Apache Hive‑1.2.1

spark‑2.[1‑3].x

mvn clean install -DskipTests

4.3 Generate a Hudi dataset

Set environment variables before running Spark jobs:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
export HIVE_HOME=/var/hadoop/setup/apache-hive-1.1.0-cdh5.7.2-bin
export HADOOP_HOME=/var/hadoop/setup/hadoop-2.6.0-cdh5.7.2
export HADOOP_INSTALL=/var/hadoop/setup/hadoop-2.6.0-cdh5.7.2
export HADOOP_CONF_DIR=$HADOOP_INSTALL/etc/hadoop
export SPARK_HOME=/var/hadoop/setup/spark-2.3.1-hadoop2.7
export SPARK_INSTALL=$SPARK_HOME
export PATH=$JAVA_HOME/bin:$HIVE_HOME/bin:$HADOOP_HOME/bin:$SPARK_INSTALL/bin:$PATH

4.4 API support

Hudi can be accessed via the DataSource API with a few lines of code or through the RDD API for more advanced operations.

5. Hudi best practices

Use a new HoodieRecordPayload type and retain the previous payload for CombineAndGetUpdateValue(...) to avoid duplicate counts in downstream incremental ETL.

Perform left joins that keep all records by key and insert rows where persisted_data.key is null, ensuring BloomIndex/metadata are fully utilized.

Add a flag field to the HoodieRecord read from payload metadata to indicate whether the old record should be copied during writes.

Pass a flag in DataFrame options to force the job to copy old records when necessary.

6. Advantages of Hudi

Overcomes HDFS scalability limits.

Provides fast data presentation within Hadoop.

Supports updates and deletes on existing data.

Enables rapid ETL and modeling.

7. Comparison with Apache Kudu

Both Hudi and Kudu target real‑time analytics on petabyte‑scale data, but Kudu is designed for OLTP workloads while Hudi focuses on OLAP. Kudu lacks incremental pull support, whereas Hudi provides it. Hudi runs on any Hadoop‑compatible file system (HDFS, S3, Ceph) and scales like other Spark jobs, while Kudu relies on its own storage servers communicating via Raft.

8. Summary

Hudi fills a critical gap for managing data on HDFS, coexisting well with other big‑data technologies. It is best suited for performing insert and update operations on Parquet‑formatted data stored on top of HDFS.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Spark Hadoop Apache Hudi

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.