Incremental Query of Hudi Tables Using Hive, Spark SQL, and Flink SQL
This guide explains how to perform incremental queries on Hudi tables by configuring Hive synchronization, using Spark SQL both programmatically and via pure SQL, and leveraging Flink SQL in batch and streaming modes, with detailed parameter settings and code examples.
Hive Incremental Query
By enabling Hive synchronization during data writes, Hudi creates a Hive table that can query the underlying Hudi data. Two views are generated: hudi_tbl for read‑optimized (columnar) access using HoodieParquetInputFormat, and hudi_tbl_rt for real‑time access using HoodieParquetRealtimeInputFormat. The real‑time view ( _rt) appears only for MOR tables when skipROSuffix=true is set; otherwise the read‑optimized view is used.
Configuration
Add Hudi packages to the Hive SQL whitelist and set the incremental query parameters via SET statements.
set hoodie.hudi_tbl.consume.mode=INCREMENTAL;
set hoodie.hudi_tbl.consume.start.timestamp=20211015182330;
set hoodie.hudi_tbl.consume.max.commits=-1;
select * from hudi_tbl where `_hoodie_commit_time` > "20211015182330";Spark SQL Incremental Query
Two approaches are covered: a programmatic DataFrame‑plus‑SQL method and a pure‑SQL method using Call Procedures.
Programmatic (DF + SQL)
import org.apache.hudi.DataSourceReadOptions.{BEGIN_INSTANTTIME, END_INSTANTTIME, INCR_PATH_GLOB, QUERY_TYPE, QUERY_TYPE_INCREMENTAL_OPT_VAL}
val tableName = "test_hudi_incremental"
spark.sql(s"""
|create table $tableName (
| id int,
| name string,
| price double,
| ts long,
| dt string
|) using hudi
|partitioned by (dt)
|options (
| primaryKey = 'id',
| preCombineField = 'ts',
| type = 'cow'
|)
""")
val incrementalDF = spark.read.format("hudi")
.option(QUERY_TYPE.key, QUERY_TYPE_INCREMENTAL_OPT_VAL)
.option(BEGIN_INSTANTTIME.key, "20221126170009762")
.option(END_INSTANTTIME.key, "20221126170023240")
.option(INCR_PATH_GLOB.key, "/dt=2022-11/*")
.load(basePath)
incrementalDF.createOrReplaceTempView("temp_$tableName")
spark.sql("select * from temp_$tableName").show()Pure SQL (Call Procedures)
call copy_to_temp_view(
table=>'test_hudi_incremental',
query_type=>'incremental',
view_name=>'temp_incremental',
begin_instance_time=>'20221130163703640',
end_instance_time=>'20221130163726780'
);
select _hoodie_commit_time, id, name, price, ts, dt from temp_incremental;When the start time changes, the temporary view must be dropped and recreated.
Flink SQL Incremental Query
Flink can query Hudi tables in both batch and streaming modes. Table creation uses the Hudi connector; incremental parameters are supplied via an options hint.
CREATE TABLE test_flink_incremental (
id int PRIMARY KEY NOT ENFORCED,
name VARCHAR(10),
price double,
ts bigint,
dt VARCHAR(10)
) PARTITIONED BY (dt) WITH (
'connector' = 'hudi',
'path' = 'hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_flink_incremental'
);Batch Incremental Query
select * from test_flink_incremental /*+ options('read.start-commit'='20221205152723') */;Specifying both read.start-commit and read.end-commit defines a closed interval [BEGIN, END]. Omitting read.end-commit reads up to the latest commit.
Streaming Incremental Query
select * from test_flink_incremental /*+ options(
'read.streaming.enabled'='true',
'read.streaming.check-interval'='4',
'read.start-commit'='20221205152712'
) */;When read.streaming.enabled is true, the query continuously consumes new commits; the start commit can be set to an early instant to read historic data.
Additional notes cover refreshing table metadata after parameter changes, using database‑qualified table names, and sinking streaming results to MySQL via a JDBC connector with primary‑key enforcement.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
