Big Data 16 min read

Integrating Apache Hudi with Hive, Presto, and Spark SQL: Installation, Operations, and Query Examples

This article provides a step‑by‑step guide on integrating Apache Hudi with Hive and Presto, demonstrates core Hudi operations such as insert, upsert, delete, query, and Hive synchronization using Scala code, and shows how to manage Hudi tables through Spark SQL DDL/DML commands.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Integrating Apache Hudi with Hive, Presto, and Spark SQL: Installation, Operations, and Query Examples

1. Hive and Presto Integration

Copy the Hudi jar to the Hive $HIVE_HOME/lib directory and create an external Hive table that points to the Hudi data path. The table definition uses org.apache.hudi.hadoop.HoodieParquetInputFormat and the Parquet SerDe.

cp ./packaging/hudi-hadoop-mr-bundle/target/hudi-hadoop-mr-bundle-0.5.2-SNAPSHOT.jar $HIVE_HOME/lib

For Presto, copy the same jar to the Presto plugin/hive-hadoop2 directory.

cp ./packaging/hudi-hadoop-mr-bundle/target/hudi-hadoop-mr-bundle-0.5.2-SNAPSHOT.jar $PRESTO_HOME/plugin/hive-hadoop2/

2. Hudi Core Operations (Scala)

Insert (non‑partitioned):

@Test
def insert(): Unit = {
  val spark = SparkSession.builder.appName("hudi insert")
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .master("local[3]")
    .getOrCreate()
  val insertData = spark.read.parquet("/tmp/1563959377698.parquet")
  insertData.write.format("org.apache.hudi")
    .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "rowkey")
    .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "lastupdatedttm")
    .option("hoodie.insert.shuffle.parallelism", "2")
    .option("hoodie.upsert.shuffle.parallelism", "2")
    .option(HoodieWriteConfig.TABLE_NAME, "test")
    .mode(SaveMode.Overwrite)
    .save("/tmp/hudi")
}

Insert with partition:

@Test
def insertPartition(): Unit = {
  val spark = SparkSession.builder.appName("hudi insert")
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .master("local[3]")
    .getOrCreate()
  val insertData = Util.readFromTxtByLineToDf(spark, "/home/huangjing/soft/git/experiment/hudi-test/src/main/resources/test_insert_data.txt")
  insertData.write.format("org.apache.hudi")
    .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "rowkey")
    .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "lastupdatedttm")
    .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "dt")
    .option(HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH, "true")
    .option(HoodieIndexConfig.INDEX_TYPE_PROP, HoodieIndex.IndexType.GLOBAL_BLOOM.name())
    .option("hoodie.insert.shuffle.parallelism", "2")
    .option("hoodie.upsert.shuffle.parallelism", "2")
    .option(HoodieWriteConfig.TABLE_NAME, "test_partition")
    .mode(SaveMode.Overwrite)
    .save("/tmp/hudi")
}

Upsert (non‑partitioned and partitioned), delete, query, and Hive synchronization follow the same pattern, using the appropriate Hudi options such as DataSourceWriteOptions.TABLE_TYPE_OPT_KEY for MOR tables and DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY for Hive sync.

3. Spark SQL Integration

Start Spark‑SQL with the Hudi bundle jar and enable the Hudi session extension:

spark-sql --jars $PATH_TO_SPARK_BUNDLE_JAR \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
  --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

Set low parallelism for demo purposes and disable automatic metadata sync:

set hoodie.upsert.shuffle.parallelism = 1;
set hoodie.insert.shuffle.parallelism = 1;
set hoodie.delete.shuffle.parallelism = 1;
set hoodie.datasource.meta.sync.enable = false;

Create a Hudi table using Spark SQL DDL:

create table test_hudi_table (
  id int,
  name string,
  price double,
  ts long,
  dt string
) using hudi
partitioned by (dt)
options (
  primaryKey = 'id',
  type = 'mor'
)
location 'file:///tmp/test_hudi_table';

Perform DML operations:

Insert:

INSERT INTO test_hudi_table SELECT 1 AS id, 'hudi' AS name, 10 AS price, 1000 AS ts, '2021-05-05' AS dt;

Update: UPDATE test_hudi_table SET price = 20.0 WHERE id = 1; Delete: DELETE FROM test_hudi_table WHERE id = 1; Merge Into (insert, update, delete) using standard Spark SQL MERGE syntax.

Finally, drop the table with DROP TABLE test_hudi_table; and verify its removal using SHOW TABLES;.

4. Common Issues & Tips

For Merge‑On‑Read tables, set

option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)

instead of the generic HOODIE_TABLE_TYPE_PROP_NAME option.

Avoid adding spark‑hive dependencies that bring Hive 1.2.1 jars; Hudi requires Hive 2.x. Exclude them if necessary.

When syncing to Hive, include hive-site.xml in the classpath to prevent metastore errors.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataHiveData LakePrestoSpark SQLApache Hudi
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.