Using Spark SQL to Operate on Apache Hudi Tables – Step‑by‑Step Guide
This tutorial demonstrates how to use Spark SQL to create, insert, update, delete, merge, and drop Apache Hudi tables, covering environment setup, Spark‑SQL launch, configuration, and a series of SQL commands with example outputs.
Abstract : The community has been awaiting the integration of Apache Hudi with Spark SQL (pull request HUDI‑1659). The integration is close to completion and will greatly simplify DDL/DML operations on Hudi tables.
Environment preparation : Clone the HUDI‑1659 branch locally and build the Spark bundle JAR, producing hudi-spark-bundle_2.11-0.9.0-SNAPSHOT.jar .
2.1 Start spark‑sql
spark-sql --jars $PATH_TO_SPARK_BUNDLE_JAR \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'2.2 Set parallelism and metadata sync
set hoodie.upsert.shuffle.parallelism = 1;
set hoodie.insert.shuffle.parallelism = 1;
set hoodie.delete.shuffle.parallelism = 1;
set hoodie.datasource.meta.sync.enable = false;Create Table
Run the following SQL to create a MOR‑type Hudi table with primary key id and partition column dt :
create table test_hudi_table (
id int,
name string,
price double,
ts long,
dt string
) using hudi
partitioned by (dt)
options (
primaryKey = 'id',
type = 'mor'
)
location 'file:///tmp/test_hudi_table';The table is created with MOR storage, id as the primary key, and dt as the partition field.
Insert Into
Insert a single record:
insert into test_hudi_table select 1 as id, 'hudi' as name, 10 as price, 1000 as ts, '2021-05-05' as dt;After insertion, the local directory shows the same metadata, partition, and data layout as a normal Spark datasource write.
Select
Query the table:
select * from test_hudi_table;The result displays the inserted row.
Update
Update the price of the row with id = 1 to 20.0 :
update test_hudi_table set price = 20.0 where id = 1;Querying again shows the updated price, and a new deltacommit folder with an incremental log file appears.
Delete
Delete the row with id = 1 :
delete from test_hudi_table where id = 1;A new deltacommit folder is generated, and subsequent queries return no rows.
Merge Into
Insert via MERGE WHEN NOT MATCHED:
merge into test_hudi_table as t0 using (
select 1 as id, 'a1' as name, 10 as price, 1000 as ts, '2021-03-21' as dt
) as s0 on t0.id = s0.id
when not matched and s0.id % 2 = 1 then insert *;Query shows one record.
Update via MERGE WHEN MATCHED:
merge into test_hudi_table as t0 using (
select 1 as id, 'a1' as name, 12 as price, 1001 as ts, '2021-03-21' as dt
) as s0 on t0.id = s0.id
when matched and s0.id % 2 = 1 then update set *;Query shows the partition and data updated.
Delete via MERGE WHEN MATCHED:
merge into test_hudi_table t0 using (
select 1 as s_id, 'a2' as s_name, 15 as s_price, 1001 as s_ts, '2021-03-21' as dt
) s0 on t0.id = s0.s_id
when matched and s_ts = 1001 then delete;After this operation the table is empty.
Drop Table
Remove the Hudi table:
drop table test_hudi_table;Running show tables; confirms that the table no longer exists.
Conclusion
The example demonstrates how Spark SQL can be used to perform Insert, Update, Delete, and Merge operations on Hudi tables, lowering the barrier to using Hudi. Future work will continue to align Hudi‑SQL syntax with Snowflake and BigQuery, adding features such as multi‑table INSERT, schema evolution, and Hudi service calls like CALL Cleaner and CALL Clustering .
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.