Flink 1.11 Integration with Hive: New Features and Real‑time Data Warehouse
The article explains how Flink 1.11 deepens its integration with Hive, covering background, new connector features, simplified dependency management, enhanced Hive dialect, streaming writes and reads, temporal table joins, and how these capabilities enable a unified batch‑streaming data warehouse.
This talk, presented by Alibaba technical expert Li Rui, introduces the evolution of Flink‑Hive integration from the early experimental support in Flink 1.9 to the production‑ready features in Flink 1.11.
Background : Flink aims to leverage its strong streaming engine for batch workloads, using SQL as the primary interface, and Hive is the de‑facto SQL engine in the Hadoop ecosystem, making Hive integration essential.
Flink 1.10 Production‑Ready Hive Support : The integration provides three core capabilities – accessing Hive metastore, reading/writing Hive tables, and production‑ready stability. A new Catalog API (GenericInMemoryCatalog and HiveCatalog) abstracts external metadata sources, while HiveCatalog adds persistent metadata via Hive Metastore and a version‑agnostic HiveShim.
Flink 1.11 New Features :
Simplified dependency management with pre‑built Hive connector bundles for different Hive versions.
Enhanced Hive dialect: parameterized dialect selection (default vs. hive), full Hive DDL/DML compatibility, and dynamic session‑level switching.
Streaming writes to Hive using the SQL‑based StreamingFileSink, supporting partitioned tables, exactly‑once semantics, and configurable commit delay, trigger, and policy.
Streaming reads from Hive via continuous file monitoring (both non‑partitioned and partitioned tables), with configurable consumption order, start offset, and monitoring interval.
Temporal table joins: Hive tables can be used as temporal lookup tables for stream‑side joins, with in‑memory caching and optional cache expiration.
The article also shows practical examples (SQL snippets and configuration images) for enabling the Hive dialect, defining streaming sinks, and performing temporal table joins.
Overall, Flink 1.11 transforms Hive warehouses into real‑time data platforms, allowing low‑latency ETL, immediate analytical queries, and seamless batch‑stream convergence.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
