Big Data 19 min read

Migrating Hive SQL Jobs to Flink Using the SQL Gateway

This article explains how to use Apache Flink 1.16's SQL Gateway to migrate Hive SQL tasks to Flink, covering the underlying Hive‑on‑Flink architecture, dialect compatibility, streaming and batch demos, configuration details, and practical tips for developers and platform engineers.

Big Data Technology & Architecture

Jan 3, 2023

Migrating Hive SQL Jobs to Flink Using the SQL Gateway

Introduction

The author describes a landscape with tens of thousands of offline jobs that currently run on Hive on Tez, Spark, or Flink, noting that Hive SQL is easy to write but often inefficient. With Flink 1.16 a new SQL Gateway feature aligns Flink's dialect with Hive, promising a smoother migration path.

Hive on Flink Principles

Flink now supports running Hive tasks directly. It can read and write Hive Metastore tables, reuse Hive UDFs, and handle both streaming and batch workloads. The integration also brings automatic small‑file merging similar to Hudi's Merge On Read model.

SQL Parsing

Flink 1.16 uses a pluggable parser that converts HQL into a logical plan, then applies CBO/ROB optimizations to produce a physical plan and finally a Flink job graph, making most Hive queries runnable without modification.

Demo

Streaming Example

Creating a target table and a Kafka source table with Hive‑compatible DDL:

-- create target table
create table if not exists dwd_category_by_day(
  i_category string,
  cate_sales double,
  cayehory_day_order_cnt bigint
) partitioned by (year bigint, day bigint)
TBLPROPERTIES('sink.partition-commit.policy.kind'='metastore,success-file');

-- set dialect to default (Hive)
set table.sql-dialect=default;

create table if not exists s_dwd_store_sales(
  ss_item_sk bigint,
  i_brand string,
  i_class string,
  i_category string,
  ss_sales_price double,
  d_date date,
  d_timestamp as cast(d_date as timestamp(3)),
  watermark for d_timestamp as d_timestamp
) with (
  'connector'='kafka',
  'topic'='dwd_store_sales',
  'properties.bootstrap.servers'='192.168.88.101:9092,192.168.88.102:9092,192.168.88.103:9092',
  'properties.group.id'='FFA',
  'key.fields'='ss_item_sk',
  'scan.startup_mode'='earlist-offset',
  'key.format'='json'
);

Running a tumbling‑window aggregation and a top‑3 query demonstrates that streaming SQL in Flink is concise and requires fewer aliases than Hive.

Batch Example

Using the same target table, a batch insert aggregates daily sales from an ORC‑backed Hive table:

insert overwrite dwd_category_by_day
select
  i_category,
  sum(ss_sales_price) as month_sales,
  count(1) as order_cnt,
  year(d_date) as `year`,
  datediff(d_date, concat(year(d_date)-1,'-12-31'))
from tpcds_bin_orc_2.dwd_store_sales
group by year(d_date), datediff(d_date, concat(year(d_date)-1,'-12-31')), i_category;

The batch job does not need watermarks or windows, making it even simpler.

SQL Gateway Deployment

The gateway can be started with ./bin/sql-gateway.sh start or in foreground mode. A typical yaml configuration defines server address, session timeouts, catalog definitions, and table sources/sinks. Example snippet:

server:
  bind-address: 127.0.0.1
  address: 127.0.0.1
  port: 8083
  jvm_args: "-Xmx2018m -Xms1024m"

session:
  idle-timeout: 1d
  max-count: 1000000

catalogs:
  - name: catalog_1
    type: hive
    property-version: 1
    hive-conf-dir: ...

Supported endpoints include REST and HiveServer2, allowing Java/JDBC clients to connect via URLs such as jdbc:flink://localhost:8083?planner=blink. The article also shows a minimal Java program that creates a table, inserts data, and queries it using the Flink JDBC driver.

Conclusion

From Flink 1.16 onward, Hive‑on‑Flink is production‑ready, enabling SQL developers to migrate most Hive queries without changes and allowing platform engineers to replace custom gateways with the built‑in SQL Gateway, simplifying operations and maintenance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Batch Processing Streaming Hive SQL Gateway

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.