Big Data 46 min read

Building a Real-Time Data Warehouse with Flink: Hive Integration, Upsert‑Kafka, and CDC Connectors

This tutorial explains how to use Apache Flink 1.12 to construct a unified streaming‑batch data warehouse by integrating Hive via HiveCatalog and HiveDialect, performing read/write operations, configuring upsert‑Kafka sinks, and leveraging Flink CDC connectors for change data capture from MySQL and other sources.

DataFunSummit

Dec 4, 2021

Building a Real-Time Data Warehouse with Flink: Hive Integration, Upsert‑Kafka, and CDC Connectors

Flink 1.12 provides built‑in support for Hive integration, allowing users to persist metadata in Hive Metastore via HiveCatalog, read and write Hive tables in both batch and streaming modes, and switch between default and Hive SQL dialects for DDL/DML.

The article details steps to add required Hive and Hadoop dependencies, configure sql-client-defaults.yaml, create Hive‑compatible and generic tables, and perform temporal joins with the latest Hive partitions using streaming source options.

It also introduces the upsert‑Kafka connector, describing its requirement for primary keys, key/value serialization formats, and configuration parameters such as value.fields-include and key.fields-prefix, with example DDL and insert statements.

Furthermore, the guide covers Flink CDC connectors, including MySQL‑CDC and Canal‑JSON, showing how to create CDC source tables, capture change events, and write aggregated results to Kafka using the changelog-json format.

Throughout, practical SQL examples, table properties, and execution hints are provided to help readers build a real‑time data warehouse that combines batch and streaming processing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink SQL Streaming Hive Real-Time Data Warehouse CDC Upsert-Kafka

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.