Apache Hudi Overview, Core Concepts, and Quick‑Start Guide
This article introduces Apache Hudi, explaining its storage types, query views, timeline feature, typical use cases such as near‑real‑time ingestion and incremental pipelines, and provides a step‑by‑step Scala/Spark quick‑start guide with code examples for compiling, inserting, updating, querying, and syncing data to Hive.
1. Introduction
Apache Hudi (Hudi) enables you to store large amounts of data on Hadoop‑compatible storage and provides two primitives that allow not only classic batch processing but also stream processing on a data lake.
Update/Delete records: Hudi uses fine‑grained file/record‑level indexes to support updates and deletes, offering transactional guarantees for write operations. Queries read the latest committed snapshot.
Change stream: Hudi can retrieve incremental streams of updated/inserted/deleted records from a given point in time, unlocking new query patterns.
If you need to quickly ingest data into HDFS or cloud storage, or if your ETL/Hive/Spark jobs are slow or resource‑heavy, Hudi can help by providing incremental read/write capabilities.
2. Basic Concepts
Storage Types
Copy on Write (COW) : Stores data only in columnar files (Parquet). Writes create new base files by rewriting entire column files, making writes expensive but reads cheap. Suitable for read‑heavy workloads.
Merge on Read (MOR) : Stores data in a combination of columnar (Parquet) and row‑based (Avro) files. Updates are written to incremental Avro files and later compacted into new Parquet files. Suitable for write‑heavy workloads.
Views (Query Modes)
Read Optimized View : Queries only the latest base Parquet files, offering the same performance as native columnar datasets.
Incremental View : Queries only files written after a specified commit/compaction, enabling change‑data capture.
Real‑time View : Dynamically merges the latest base files with incremental files, providing near‑real‑time results (a few minutes latency). This view works only with MOR.
3. Timeline
Hudi maintains a timeline for every operation (write, delete, compaction, etc.) with timestamps. By querying the timeline, you can retrieve data as of a specific commit or before a certain time, avoiding full scans and efficiently consuming only changed files.
4. Typical Application Scenarios
4.1 Near‑real‑time Ingestion
Hudi can ingest data from event logs, relational databases (via binlog or Sqoop), and NoSQL stores (Cassandra, HBase, etc.) with upserts, providing faster and more efficient loading than bulk batch jobs.
4.2 Near‑real‑time Analytics
By shortening data freshness to minutes, Hudi offers an efficient alternative to dedicated streaming analytics engines, allowing interactive SQL queries (Presto, SparkSQL) on up‑to‑date data without extra infrastructure.
4.3 Incremental Processing Pipelines
Hudi’s record‑level incremental reads enable downstream pipelines to consume only newly changed data, reducing latency and avoiding repeated processing of large historical partitions.
4.4 Data Distribution on DFS
Instead of duplicating data between DFS and a messaging system, you can write to a Hudi table and use its incremental view to stream new data to downstream services, achieving unified storage.
5. Getting Started Example
5.1 Compile
Clone the repository and build with Maven (skip tests):
cd incubator-hudi-hoodie-0.4.7
mvn clean install -DskipITs -DskipTests -Dhadoop.version=2.6.0-cdh5.13.0 -Dhive.version=1.1.0-cdh5.13.05.2 Quick Start
1. Create Project
Add Scala, Spark, and Hudi dependencies to your Maven pom.xml (excerpt):
<properties>
<scala.version>2.11</scala.version>
<spark.version>2.4.0</spark.version>
<parquet.version>1.10.1</parquet.version>
<hudi.version>0.4.7</hudi.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
... (other Spark, Hudi, Avro, Parquet dependencies) ...
</dependencies>2. Insert Data
Prepare a JSON file with records and build a Spark session:
val spark = SparkSession.builder
.master("local")
.appName("Demo2")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.enableHiveSupport()
.getOrCreateRead the JSON and write to a Hudi table:
val jsonData = spark.read.json("file:///path/to/insert.json")
val tableName = "test_data"
val basePath = "file:///path/to/hudi_data/" + tableName
jsonData.write.format("com.uber.hoodie")
.option("hoodie.upsert.shuffle.parallelism", "1")
.option(PRECOMBINE_FIELD_OPT_KEY, "id")
.option(RECORDKEY_FIELD_OPT_KEY, "id")
.option(KEYGENERATOR_CLASS_OPT_KEY, "com.mbp.study.DayKeyGenerator")
.option(TABLE_NAME, tableName)
.mode(SaveMode.Overwrite)
.save(basePath)3. Query Data
val df = spark.read.format("com.uber.hoodie").load(basePath + "/*/*")
df.show(false)4. Update Data
Create an update JSON and perform an upsert (append mode):
val updateJson = spark.read.json("/path/to/update.json")
updateJson.write.format("com.uber.hoodie")
.option("hoodie.insert.shuffle.parallelism", "2")
.option("hoodie.upsert.shuffle.parallelism", "2")
.option(PRECOMBINE_FIELD_OPT_KEY, "id")
.option(RECORDKEY_FIELD_OPT_KEY, "id")
.option(TABLE_NAME, tableName)
.mode(SaveMode.Append)
.save(basePath)5. Incremental Query
val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime from hudi_ro_table order by commitTime").map(_.getString(0)).take(50)
val beginTime = commits(commits.length - 2)
val incViewDF = spark.read.format("org.apache.hudi")
.option(VIEW_TYPE_OPT_KEY, VIEW_TYPE_INCREMENTAL_OPT_VAL)
.option(BEGIN_INSTANTTIME_OPT_KEY, beginTime)
.load(basePath)
incViewDF.createOrReplaceTempView("hudi_incr_table")
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_incr_table where fare > 20.0").show()6. Point‑in‑Time Query
val beginTime = "000"
val endTime = commits(commits.length - 2)
val incViewDF = spark.read.format("org.apache.hudi")
.option(VIEW_TYPE_OPT_KEY, VIEW_TYPE_INCREMENTAL_OPT_VAL)
.option(BEGIN_INSTANTTIME_OPT_KEY, beginTime)
.option(END_INSTANTTIME_OPT_KEY, endTime)
.load(basePath)
incViewDF.createOrReplaceTempView("hudi_incr_table")
spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_incr_table where fare > 20.0").show()7. Sync to Hive
When writing, enable Hive sync and provide Hive connection details:
jsonData.write.format("com.uber.hoodie")
.option("hoodie.upsert.shuffle.parallelism", "1")
.option(HIVE_PARTITION_FIELDS_OPT_KEY, "etl_tx_dt")
.option(HIVE_URL_OPT_KEY, "jdbc:hive2://xxx.xxx.xxx.xxx:10000")
.option(HIVE_USER_OPT_KEY, "hive")
.option(HIVE_PASS_OPT_KEY, "123")
.option(HIVE_DATABASE_OPT_KEY, "test")
.option(HIVE_SYNC_ENABLED_OPT_KEY, true)
.option(HIVE_TABLE_OPT_KEY, tableName)
.option(PRECOMBINE_FIELD_OPT_KEY, "id")
.option(RECORDKEY_FIELD_OPT_KEY, "id")
.option(TABLE_NAME, tableName)
.mode(SaveMode.Append)
.save(basePath)Upload the required Hudi jars (e.g., hoodie-hadoop-mr-0.4.7.jar, hoodie-common-0.4.7.jar) to the Hive cluster and query the table directly.
6. Conclusion
The article provides a comprehensive introduction to Apache Hudi, covering its architecture, query capabilities, common use cases, and a hands‑on guide to get started with Scala and Spark, including data ingestion, updates, incremental queries, and Hive integration.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
