Big Data 18 min read

Apache Hudi Overview, Core Concepts, and Quick‑Start Guide

This article introduces Apache Hudi, explaining its storage types, query views, timeline feature, typical use cases such as near‑real‑time ingestion and incremental pipelines, and provides a step‑by‑step Scala/Spark quick‑start guide with code examples for compiling, inserting, updating, querying, and syncing data to Hive.

Big Data Technology & Architecture

Aug 23, 2020

Apache Hudi Overview, Core Concepts, and Quick‑Start Guide

1. Introduction

Apache Hudi (Hudi) enables you to store large amounts of data on Hadoop‑compatible storage and provides two primitives that allow not only classic batch processing but also stream processing on a data lake.

Update/Delete records: Hudi uses fine‑grained file/record‑level indexes to support updates and deletes, offering transactional guarantees for write operations. Queries read the latest committed snapshot.

Change stream: Hudi can retrieve incremental streams of updated/inserted/deleted records from a given point in time, unlocking new query patterns.

If you need to quickly ingest data into HDFS or cloud storage, or if your ETL/Hive/Spark jobs are slow or resource‑heavy, Hudi can help by providing incremental read/write capabilities.

2. Basic Concepts

Storage Types

Copy on Write (COW) : Stores data only in columnar files (Parquet). Writes create new base files by rewriting entire column files, making writes expensive but reads cheap. Suitable for read‑heavy workloads.

Merge on Read (MOR) : Stores data in a combination of columnar (Parquet) and row‑based (Avro) files. Updates are written to incremental Avro files and later compacted into new Parquet files. Suitable for write‑heavy workloads.

Views (Query Modes)

Read Optimized View : Queries only the latest base Parquet files, offering the same performance as native columnar datasets.

Incremental View : Queries only files written after a specified commit/compaction, enabling change‑data capture.

Real‑time View : Dynamically merges the latest base files with incremental files, providing near‑real‑time results (a few minutes latency). This view works only with MOR.

3. Timeline

Hudi maintains a timeline for every operation (write, delete, compaction, etc.) with timestamps. By querying the timeline, you can retrieve data as of a specific commit or before a certain time, avoiding full scans and efficiently consuming only changed files.

4. Typical Application Scenarios

4.1 Near‑real‑time Ingestion

Hudi can ingest data from event logs, relational databases (via binlog or Sqoop), and NoSQL stores (Cassandra, HBase, etc.) with upserts, providing faster and more efficient loading than bulk batch jobs.

4.2 Near‑real‑time Analytics

By shortening data freshness to minutes, Hudi offers an efficient alternative to dedicated streaming analytics engines, allowing interactive SQL queries (Presto, SparkSQL) on up‑to‑date data without extra infrastructure.

4.3 Incremental Processing Pipelines

Hudi’s record‑level incremental reads enable downstream pipelines to consume only newly changed data, reducing latency and avoiding repeated processing of large historical partitions.

4.4 Data Distribution on DFS

Instead of duplicating data between DFS and a messaging system, you can write to a Hudi table and use its incremental view to stream new data to downstream services, achieving unified storage.

5. Getting Started Example

5.1 Compile

Clone the repository and build with Maven (skip tests):

cd incubator-hudi-hoodie-0.4.7
mvn clean install -DskipITs -DskipTests -Dhadoop.version=2.6.0-cdh5.13.0 -Dhive.version=1.1.0-cdh5.13.0

5.2 Quick Start

1. Create Project

Add Scala, Spark, and Hudi dependencies to your Maven pom.xml (excerpt):

<properties>
   <scala.version>2.11</scala.version>
   <spark.version>2.4.0</spark.version>
   <parquet.version>1.10.1</parquet.version>
   <hudi.version>0.4.7</hudi.version>
</properties>

<dependencies>
   <dependency>
       <groupId>org.apache.spark</groupId>
       <artifactId>spark-core_${scala.version}</artifactId>
       <version>${spark.version}</version>
   </dependency>
   ... (other Spark, Hudi, Avro, Parquet dependencies) ...
</dependencies>

2. Insert Data

Prepare a JSON file with records and build a Spark session:

val spark = SparkSession.builder
      .master("local")
      .appName("Demo2")
      .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .enableHiveSupport()
      .getOrCreate

Read the JSON and write to a Hudi table:

val jsonData = spark.read.json("file:///path/to/insert.json")
val tableName = "test_data"
val basePath = "file:///path/to/hudi_data/" + tableName

jsonData.write.format("com.uber.hoodie")
      .option("hoodie.upsert.shuffle.parallelism", "1")
      .option(PRECOMBINE_FIELD_OPT_KEY, "id")
      .option(RECORDKEY_FIELD_OPT_KEY, "id")
      .option(KEYGENERATOR_CLASS_OPT_KEY, "com.mbp.study.DayKeyGenerator")
      .option(TABLE_NAME, tableName)
      .mode(SaveMode.Overwrite)
      .save(basePath)

3. Query Data

val df = spark.read.format("com.uber.hoodie").load(basePath + "/*/*")
df.show(false)

4. Update Data

Create an update JSON and perform an upsert (append mode):

val updateJson = spark.read.json("/path/to/update.json")
updateJson.write.format("com.uber.hoodie")
  .option("hoodie.insert.shuffle.parallelism", "2")
  .option("hoodie.upsert.shuffle.parallelism", "2")
  .option(PRECOMBINE_FIELD_OPT_KEY, "id")
  .option(RECORDKEY_FIELD_OPT_KEY, "id")
  .option(TABLE_NAME, tableName)
  .mode(SaveMode.Append)
  .save(basePath)

5. Incremental Query

val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime from hudi_ro_table order by commitTime").map(_.getString(0)).take(50)
val beginTime = commits(commits.length - 2)

val incViewDF = spark.read.format("org.apache.hudi")
    .option(VIEW_TYPE_OPT_KEY, VIEW_TYPE_INCREMENTAL_OPT_VAL)
    .option(BEGIN_INSTANTTIME_OPT_KEY, beginTime)
    .load(basePath)
incViewDF.createOrReplaceTempView("hudi_incr_table")
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_incr_table where fare > 20.0").show()

6. Point‑in‑Time Query

val beginTime = "000"
val endTime = commits(commits.length - 2)
val incViewDF = spark.read.format("org.apache.hudi")
    .option(VIEW_TYPE_OPT_KEY, VIEW_TYPE_INCREMENTAL_OPT_VAL)
    .option(BEGIN_INSTANTTIME_OPT_KEY, beginTime)
    .option(END_INSTANTTIME_OPT_KEY, endTime)
    .load(basePath)
incViewDF.createOrReplaceTempView("hudi_incr_table")
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_incr_table where fare > 20.0").show()

7. Sync to Hive

When writing, enable Hive sync and provide Hive connection details:

jsonData.write.format("com.uber.hoodie")
      .option("hoodie.upsert.shuffle.parallelism", "1")
      .option(HIVE_PARTITION_FIELDS_OPT_KEY, "etl_tx_dt")
      .option(HIVE_URL_OPT_KEY, "jdbc:hive2://xxx.xxx.xxx.xxx:10000")
      .option(HIVE_USER_OPT_KEY, "hive")
      .option(HIVE_PASS_OPT_KEY, "123")
      .option(HIVE_DATABASE_OPT_KEY, "test")
      .option(HIVE_SYNC_ENABLED_OPT_KEY, true)
      .option(HIVE_TABLE_OPT_KEY, tableName)
      .option(PRECOMBINE_FIELD_OPT_KEY, "id")
      .option(RECORDKEY_FIELD_OPT_KEY, "id")
      .option(TABLE_NAME, tableName)
      .mode(SaveMode.Append)
      .save(basePath)

Upload the required Hudi jars (e.g., hoodie-hadoop-mr-0.4.7.jar, hoodie-common-0.4.7.jar) to the Hive cluster and query the table directly.

6. Conclusion

The article provides a comprehensive introduction to Apache Hudi, covering its architecture, query capabilities, common use cases, and a hands‑on guide to get started with Scala and Spark, including data ingestion, updates, incremental queries, and Hive integration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Lake Spark Apache Hudi Incremental Processing Scala Hive Integration

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.