Big Data 7 min read

Using Spark SQL to Operate on Apache Hudi Tables – Step‑by‑Step Guide

This tutorial demonstrates how to use Spark SQL to create, insert, update, delete, merge, and drop Apache Hudi tables, covering environment setup, Spark‑SQL launch, configuration, and a series of SQL commands with example outputs.

Big Data Technology Architecture

May 6, 2021

Using Spark SQL to Operate on Apache Hudi Tables – Step‑by‑Step Guide

Abstract : The community has been awaiting the integration of Apache Hudi with Spark SQL (pull request HUDI‑1659). The integration is close to completion and will greatly simplify DDL/DML operations on Hudi tables.

Environment preparation : Clone the HUDI‑1659 branch locally and build the Spark bundle JAR, producing hudi-spark-bundle_2.11-0.9.0-SNAPSHOT.jar.

2.1 Start spark‑sql

spark-sql --jars $PATH_TO_SPARK_BUNDLE_JAR \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
  --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

2.2 Set parallelism and metadata sync

set hoodie.upsert.shuffle.parallelism = 1;
set hoodie.insert.shuffle.parallelism = 1;
set hoodie.delete.shuffle.parallelism = 1;
set hoodie.datasource.meta.sync.enable = false;

Create Table

Run the following SQL to create a MOR‑type Hudi table with primary key id and partition column dt:

create table test_hudi_table (
  id int,
  name string,
  price double,
  ts long,
  dt string
) using hudi
partitioned by (dt)
options (
  primaryKey = 'id',
  type = 'mor'
)
location 'file:///tmp/test_hudi_table';

The table is created with MOR storage, id as the primary key, and dt as the partition field.

Insert Into

Insert a single record:

insert into test_hudi_table select 1 as id, 'hudi' as name, 10 as price, 1000 as ts, '2021-05-05' as dt;

After insertion, the local directory shows the same metadata, partition, and data layout as a normal Spark datasource write.

Select

Query the table: select * from test_hudi_table; The result displays the inserted row.

Update

Update the price of the row with id = 1 to 20.0:

update test_hudi_table set price = 20.0 where id = 1;

Querying again shows the updated price, and a new deltacommit folder with an incremental log file appears.

Delete

Delete the row with id = 1: delete from test_hudi_table where id = 1; A new deltacommit folder is generated, and subsequent queries return no rows.

Merge Into

Insert via MERGE WHEN NOT MATCHED:

merge into test_hudi_table as t0 using (
  select 1 as id, 'a1' as name, 10 as price, 1000 as ts, '2021-03-21' as dt
) as s0 on t0.id = s0.id
when not matched and s0.id % 2 = 1 then insert *;

Query shows one record.

Update via MERGE WHEN MATCHED:

merge into test_hudi_table as t0 using (
  select 1 as id, 'a1' as name, 12 as price, 1001 as ts, '2021-03-21' as dt
) as s0 on t0.id = s0.id
when matched and s0.id % 2 = 1 then update set *;

Query shows the partition and data updated.

Delete via MERGE WHEN MATCHED:

merge into test_hudi_table t0 using (
  select 1 as s_id, 'a2' as s_name, 15 as s_price, 1001 as s_ts, '2021-03-21' as dt
) s0 on t0.id = s0.s_id
when matched and s_ts = 1001 then delete;

After this operation the table is empty.

Drop Table

Remove the Hudi table: drop table test_hudi_table; Running show tables; confirms that the table no longer exists.

Conclusion

The example demonstrates how Spark SQL can be used to perform Insert, Update, Delete, and Merge operations on Hudi tables, lowering the barrier to using Hudi. Future work will continue to align Hudi‑SQL syntax with Snowflake and BigQuery, adding features such as multi‑table INSERT, schema evolution, and Hudi service calls like CALL Cleaner and CALL Clustering.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SQL Spark SQL Apache Hudi

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.