Big Data 20 min read

Incremental Query of Hudi Tables Using Hive, Spark SQL, and Flink SQL

This guide explains how to perform incremental queries on Hudi tables by configuring Hive synchronization, using Spark SQL both programmatically and via pure SQL, and leveraging Flink SQL in batch and streaming modes, with detailed parameter settings and code examples.

Big Data Technology & Architecture

Jul 17, 2023

Incremental Query of Hudi Tables Using Hive, Spark SQL, and Flink SQL

Hive Incremental Query

By enabling Hive synchronization during data writes, Hudi creates a Hive table that can query the underlying Hudi data. Two views are generated: hudi_tbl for read‑optimized (columnar) access using HoodieParquetInputFormat, and hudi_tbl_rt for real‑time access using HoodieParquetRealtimeInputFormat. The real‑time view ( _rt) appears only for MOR tables when skipROSuffix=true is set; otherwise the read‑optimized view is used.

Configuration

Add Hudi packages to the Hive SQL whitelist and set the incremental query parameters via SET statements.

set hoodie.hudi_tbl.consume.mode=INCREMENTAL;
set hoodie.hudi_tbl.consume.start.timestamp=20211015182330;
set hoodie.hudi_tbl.consume.max.commits=-1;
select * from hudi_tbl where `_hoodie_commit_time` > "20211015182330";

Spark SQL Incremental Query

Two approaches are covered: a programmatic DataFrame‑plus‑SQL method and a pure‑SQL method using Call Procedures.

Programmatic (DF + SQL)

import org.apache.hudi.DataSourceReadOptions.{BEGIN_INSTANTTIME, END_INSTANTTIME, INCR_PATH_GLOB, QUERY_TYPE, QUERY_TYPE_INCREMENTAL_OPT_VAL}
val tableName = "test_hudi_incremental"

spark.sql(s"""
  |create table $tableName (
  |  id int,
  |  name string,
  |  price double,
  |  ts long,
  |  dt string
  |) using hudi
  |partitioned by (dt)
  |options (
  |  primaryKey = 'id',
  |  preCombineField = 'ts',
  |  type = 'cow'
  |)
""")

val incrementalDF = spark.read.format("hudi")
  .option(QUERY_TYPE.key, QUERY_TYPE_INCREMENTAL_OPT_VAL)
  .option(BEGIN_INSTANTTIME.key, "20221126170009762")
  .option(END_INSTANTTIME.key, "20221126170023240")
  .option(INCR_PATH_GLOB.key, "/dt=2022-11/*")
  .load(basePath)

incrementalDF.createOrReplaceTempView("temp_$tableName")
spark.sql("select * from temp_$tableName").show()

Pure SQL (Call Procedures)

call copy_to_temp_view(
  table=>'test_hudi_incremental',
  query_type=>'incremental',
  view_name=>'temp_incremental',
  begin_instance_time=>'20221130163703640',
  end_instance_time=>'20221130163726780'
);

select _hoodie_commit_time, id, name, price, ts, dt from temp_incremental;

When the start time changes, the temporary view must be dropped and recreated.

Flink SQL Incremental Query

Flink can query Hudi tables in both batch and streaming modes. Table creation uses the Hudi connector; incremental parameters are supplied via an options hint.

CREATE TABLE test_flink_incremental (
  id int PRIMARY KEY NOT ENFORCED,
  name VARCHAR(10),
  price double,
  ts bigint,
  dt VARCHAR(10)
) PARTITIONED BY (dt) WITH (
  'connector' = 'hudi',
  'path' = 'hdfs://cluster1/warehouse/tablespace/managed/hive/hudi.db/test_flink_incremental'
);

Batch Incremental Query

select * from test_flink_incremental /*+ options('read.start-commit'='20221205152723') */;

Specifying both read.start-commit and read.end-commit defines a closed interval [BEGIN, END]. Omitting read.end-commit reads up to the latest commit.

Streaming Incremental Query

select * from test_flink_incremental /*+ options(
  'read.streaming.enabled'='true',
  'read.streaming.check-interval'='4',
  'read.start-commit'='20221205152712'
) */;

When read.streaming.enabled is true, the query continuously consumes new commits; the start commit can be set to an early instant to read historic data.

Additional notes cover refreshing table metadata after parameter changes, using database‑qualified table names, and sinking streaming results to MySQL via a JDBC connector with primary‑key enforcement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Hive Spark SQL Hudi Flink SQL Incremental Query

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.