Big Data 9 min read

Design and Integration of Flink Batch Processing with Hive: Architecture, Features, and Performance Evaluation

This article presents the design of Flink's batch processing architecture, its integration with Hive through a unified Catalog API, details the enhancements in Flink 1.10, outlines future work, and reports a performance test showing roughly seven‑fold speedup over Hive on MapReduce.

DataFunTalk

Feb 19, 2020

Design and Integration of Flink Batch Processing with Hive: Architecture, Features, and Performance Evaluation

Flink's growing adoption in stream processing motivates extending its capabilities to batch workloads, aiming to lower development and maintenance costs while enriching the Flink ecosystem. Since SQL is the dominant interface for batch jobs, Flink focuses on enhancing FlinkSQL for batch processing and Hive integration.

Background : The need to support batch processing stems from reducing customer maintenance costs and improving ecosystem completeness. FlinkSQL is optimized to address shortcomings such as lack of full metadata management, limited DDL support, and cumbersome Hive integration.

Goals include defining a unified Catalog interface for easier external system integration, providing both in‑memory and persistent implementations, and enabling seamless read/write access to Hive data via a Data Connector.

New Catalog API (FLIP‑30) : Users submit SQL or Table API jobs, which create a TableEnvironment that loads a CatalogManager. Two implementations exist in Flink 1.9: GenericInMemoryCatalog (in‑memory metadata tied to the session) and HiveCatalog (interacts with Hive Metastore via HiveShim to handle version incompatibilities). This design allows multiple catalogs and cross‑catalog queries.

Reading and Writing Hive Data : The Data Connector uses HiveTableSource and HiveTableInputFormat for reads, and HiveTableSink and HiveTableOutputFormat for writes, reusing Hive's Input/Output Formats and SerDe to ensure compatibility.

Flink 1.9.0 Status : FlinkSQL was experimental, lacking full data‑type support (e.g., DECIMAL, CHAR), incomplete partition handling, and missing INSERT OVERWRITE.

Flink 1.10.0 New Features include full static and dynamic partition support, INSERT OVERWRITE at table and partition levels, expanded data‑type coverage (including UNION), richer DDL (CREATE TABLE/DATABASE), access to ~200 Hive built‑in functions, support for Hive 1.0.0‑3.1.1, and performance optimizations such as project/predicate push‑down and vectorized ORC reads.

Module Interface : Introduced to load external functions. Users configure Modules via Table API or YAML. Two implementations exist: CoreModule (Flink native functions) and HiveModule (Hive functions). Loading order determines function resolution when names clash.

Future Work aims to complete features such as view support, enhanced SQL CLI usability (paging, scrolling, non‑interactive mode), full Hive DDL support (e.g., CREATE TABLE AS), seamless Hive‑to‑Flink migration, remote CLI mode akin to HiveServer2, and streaming writes to Hive.

Performance Test : Conducted on a 21‑node cluster (1 master, 20 slaves, each with 32 cores, 64 threads, 256 GB RAM, 12 × HDD). Using Hortonworks hive‑testbench, a 10 TB TPC‑DS dataset was generated and processed by both FlinkSQL (master branch) and Hive 3.1.1 on MapReduce. Results show FlinkSQL achieving approximately 7× speedup, attributed to scheduling and execution‑plan optimizations.

Readers are invited to join the DataFunTalk big‑data community for further discussion.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink SQL batch processing performance testing Hive Catalog API

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.