Big Data 13 min read

Migrating Hive SQL to Flink SQL: Motivation, Challenges, Practice, Demo, and Future Plans

This technical article presents a comprehensive overview of migrating Hive SQL to Flink SQL, covering the motivations behind the migration, key challenges such as compatibility, stability and performance, practical implementation steps, a detailed demo, future development directions, and a Q&A session addressing common concerns.

Big Data Technology & Architecture

Dec 15, 2022

Migrating Hive SQL to Flink SQL: Motivation, Challenges, Practice, Demo, and Future Plans

This article is compiled from Apache Flink PMC & Committer Wu Chong’s presentation at the Apache Flink Meetup on September 24, outlining the motivation, challenges, practical steps, demonstration, and future roadmap for migrating Hive SQL workloads to Flink SQL.

Motivation: Flink is the de‑facto standard for stream computing, and extending support for Hive SQL helps attract Hive offline warehouse users, lowers the entry barrier for batch workloads, integrates the rich Hive ecosystem, and enables a unified stream‑batch processing model.

Challenges: The migration faces three major hurdles: (1) Compatibility – ensuring Hive offline jobs and Hive tools work seamlessly on Flink; (2) Stability – guaranteeing production‑grade reliability, addressed by features like FLIP‑168 and Adaptive Hash Join; (3) Performance – improving execution speed through Dynamic Partition Pruning, metadata acceleration, and other optimizations.

Practice: Flink reuses its core SQL processing pipeline for Hive compatibility: a pluggable parser converts Hive SQL to Flink RelNode, which then follows the usual logical‑plan, physical‑plan, and JobGraph generation. Flink 1.16 raised Hive compatibility from 85 % to 94.1 % (over 12,000 qtest cases) and added HiveServer2 protocol support via the SQLGateway component.

Demo: The demonstration shows how to configure a Hive interpreter in Zeppelin, create a wide table using Hive DDL, run the original Hive SQL, then switch the JDBC endpoint to Flink’s SQLGateway (port 20002) and re‑execute the same queries. The results are identical, but now run on Flink’s streaming engine, illustrating seamless migration with minimal changes.

Future Plans: Flink will continue to strengthen batch processing (stability, performance, parity with leading batch engines), enhance data‑lake analytics (support for Iceberg, Hudi, Table Store, richer DML like UPDATE/DELETE/MERGE), and expand the batch ecosystem (Remote Shuffle Service, lineage management).

Q&A Highlights: Answers address memory management (spill‑to‑disk prevents OOM), UDF migration (supported directly), dual‑run migration strategy, task manager allocation across deployment modes, and shuffle storage options for Kubernetes deployments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Batch Processing Streaming Hive Data Lake SQL Migration

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.