Integrating Apache Hudi with Hive, Presto, and Spark SQL: Installation, Operations, and Query Examples
This article provides a step‑by‑step guide on integrating Apache Hudi with Hive and Presto, demonstrates core Hudi operations such as insert, upsert, delete, query, and Hive synchronization using Scala code, and shows how to manage Hudi tables through Spark SQL DDL/DML commands.
1. Hive and Presto Integration
Copy the Hudi jar to the Hive $HIVE_HOME/lib directory and create an external Hive table that points to the Hudi data path. The table definition uses org.apache.hudi.hadoop.HoodieParquetInputFormat and the Parquet SerDe.
cp ./packaging/hudi-hadoop-mr-bundle/target/hudi-hadoop-mr-bundle-0.5.2-SNAPSHOT.jar $HIVE_HOME/libFor Presto, copy the same jar to the Presto plugin/hive-hadoop2 directory.
cp ./packaging/hudi-hadoop-mr-bundle/target/hudi-hadoop-mr-bundle-0.5.2-SNAPSHOT.jar $PRESTO_HOME/plugin/hive-hadoop2/2. Hudi Core Operations (Scala)
Insert (non‑partitioned):
@Test
def insert(): Unit = {
val spark = SparkSession.builder.appName("hudi insert")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.master("local[3]")
.getOrCreate()
val insertData = spark.read.parquet("/tmp/1563959377698.parquet")
insertData.write.format("org.apache.hudi")
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "rowkey")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "lastupdatedttm")
.option("hoodie.insert.shuffle.parallelism", "2")
.option("hoodie.upsert.shuffle.parallelism", "2")
.option(HoodieWriteConfig.TABLE_NAME, "test")
.mode(SaveMode.Overwrite)
.save("/tmp/hudi")
}Insert with partition:
@Test
def insertPartition(): Unit = {
val spark = SparkSession.builder.appName("hudi insert")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.master("local[3]")
.getOrCreate()
val insertData = Util.readFromTxtByLineToDf(spark, "/home/huangjing/soft/git/experiment/hudi-test/src/main/resources/test_insert_data.txt")
insertData.write.format("org.apache.hudi")
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "rowkey")
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "lastupdatedttm")
.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "dt")
.option(HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH, "true")
.option(HoodieIndexConfig.INDEX_TYPE_PROP, HoodieIndex.IndexType.GLOBAL_BLOOM.name())
.option("hoodie.insert.shuffle.parallelism", "2")
.option("hoodie.upsert.shuffle.parallelism", "2")
.option(HoodieWriteConfig.TABLE_NAME, "test_partition")
.mode(SaveMode.Overwrite)
.save("/tmp/hudi")
}Upsert (non‑partitioned and partitioned), delete, query, and Hive synchronization follow the same pattern, using the appropriate Hudi options such as DataSourceWriteOptions.TABLE_TYPE_OPT_KEY for MOR tables and DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY for Hive sync.
3. Spark SQL Integration
Start Spark‑SQL with the Hudi bundle jar and enable the Hudi session extension:
spark-sql --jars $PATH_TO_SPARK_BUNDLE_JAR \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'Set low parallelism for demo purposes and disable automatic metadata sync:
set hoodie.upsert.shuffle.parallelism = 1;
set hoodie.insert.shuffle.parallelism = 1;
set hoodie.delete.shuffle.parallelism = 1;
set hoodie.datasource.meta.sync.enable = false;Create a Hudi table using Spark SQL DDL:
create table test_hudi_table (
id int,
name string,
price double,
ts long,
dt string
) using hudi
partitioned by (dt)
options (
primaryKey = 'id',
type = 'mor'
)
location 'file:///tmp/test_hudi_table';Perform DML operations:
Insert:
INSERT INTO test_hudi_table SELECT 1 AS id, 'hudi' AS name, 10 AS price, 1000 AS ts, '2021-05-05' AS dt;Update: UPDATE test_hudi_table SET price = 20.0 WHERE id = 1; Delete: DELETE FROM test_hudi_table WHERE id = 1; Merge Into (insert, update, delete) using standard Spark SQL MERGE syntax.
Finally, drop the table with DROP TABLE test_hudi_table; and verify its removal using SHOW TABLES;.
4. Common Issues & Tips
For Merge‑On‑Read tables, set
option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)instead of the generic HOODIE_TABLE_TYPE_PROP_NAME option.
Avoid adding spark‑hive dependencies that bring Hive 1.2.1 jars; Hudi requires Hive 2.x. Exclude them if necessary.
When syncing to Hive, include hive-site.xml in the classpath to prevent metastore errors.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
