Migrating Hive SQL Jobs to Flink Using the SQL Gateway
This article explains how to use Apache Flink 1.16's SQL Gateway to migrate Hive SQL tasks to Flink, covering the underlying Hive‑on‑Flink architecture, dialect compatibility, streaming and batch demos, configuration details, and practical tips for developers and platform engineers.
Introduction
The author describes a landscape with tens of thousands of offline jobs that currently run on Hive on Tez, Spark, or Flink, noting that Hive SQL is easy to write but often inefficient. With Flink 1.16 a new SQL Gateway feature aligns Flink's dialect with Hive, promising a smoother migration path.
Hive on Flink Principles
Flink now supports running Hive tasks directly. It can read and write Hive Metastore tables, reuse Hive UDFs, and handle both streaming and batch workloads. The integration also brings automatic small‑file merging similar to Hudi's Merge On Read model.
SQL Parsing
Flink 1.16 uses a pluggable parser that converts HQL into a logical plan, then applies CBO/ROB optimizations to produce a physical plan and finally a Flink job graph, making most Hive queries runnable without modification.
Demo
Streaming Example
Creating a target table and a Kafka source table with Hive‑compatible DDL:
-- create target table
create table if not exists dwd_category_by_day(
i_category string,
cate_sales double,
cayehory_day_order_cnt bigint
) partitioned by (year bigint, day bigint)
TBLPROPERTIES('sink.partition-commit.policy.kind'='metastore,success-file');
-- set dialect to default (Hive)
set table.sql-dialect=default;
create table if not exists s_dwd_store_sales(
ss_item_sk bigint,
i_brand string,
i_class string,
i_category string,
ss_sales_price double,
d_date date,
d_timestamp as cast(d_date as timestamp(3)),
watermark for d_timestamp as d_timestamp
) with (
'connector'='kafka',
'topic'='dwd_store_sales',
'properties.bootstrap.servers'='192.168.88.101:9092,192.168.88.102:9092,192.168.88.103:9092',
'properties.group.id'='FFA',
'key.fields'='ss_item_sk',
'scan.startup_mode'='earlist-offset',
'key.format'='json'
);Running a tumbling‑window aggregation and a top‑3 query demonstrates that streaming SQL in Flink is concise and requires fewer aliases than Hive.
Batch Example
Using the same target table, a batch insert aggregates daily sales from an ORC‑backed Hive table:
insert overwrite dwd_category_by_day
select
i_category,
sum(ss_sales_price) as month_sales,
count(1) as order_cnt,
year(d_date) as `year`,
datediff(d_date, concat(year(d_date)-1,'-12-31'))
from tpcds_bin_orc_2.dwd_store_sales
group by year(d_date), datediff(d_date, concat(year(d_date)-1,'-12-31')), i_category;The batch job does not need watermarks or windows, making it even simpler.
SQL Gateway Deployment
The gateway can be started with ./bin/sql-gateway.sh start or in foreground mode. A typical yaml configuration defines server address, session timeouts, catalog definitions, and table sources/sinks. Example snippet:
server:
bind-address: 127.0.0.1
address: 127.0.0.1
port: 8083
jvm_args: "-Xmx2018m -Xms1024m"
session:
idle-timeout: 1d
max-count: 1000000
catalogs:
- name: catalog_1
type: hive
property-version: 1
hive-conf-dir: ...Supported endpoints include REST and HiveServer2, allowing Java/JDBC clients to connect via URLs such as jdbc:flink://localhost:8083?planner=blink. The article also shows a minimal Java program that creates a table, inserts data, and queries it using the Flink JDBC driver.
Conclusion
From Flink 1.16 onward, Hive‑on‑Flink is production‑ready, enabling SQL developers to migrate most Hive queries without changes and allowing platform engineers to replace custom gateways with the built‑in SQL Gateway, simplifying operations and maintenance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
