Big Data 10 min read

Lakehouse Architecture Practice with Flink and Iceberg: Real‑time Data Ingestion and Management

This article details a lakehouse architecture built on Flink and Iceberg that addresses Hive‑based warehouse limitations by enabling ACID transactions, incremental snapshots, stream‑batch unification, CDC support, and various operational optimizations, ultimately achieving near real‑time data ingestion and analytics.

HomeTech

Nov 17, 2021

Lakehouse Architecture Practice with Flink and Iceberg: Real‑time Data Ingestion and Management

The presentation by Di Xingxing at the Shanghai meetup on April 17 introduced a lakehouse architecture that combines Flink and Iceberg to overcome the shortcomings of a traditional Hive‑based data warehouse, such as lack of ACID support, poor timeliness, and limited schema evolution.

Iceberg’s key features—full ACID semantics, incremental snapshot mechanism, open table format, and native stream‑batch interfaces—provide the foundation for a unified storage layer.

Implementation highlights include an append‑only log ingestion pipeline (Kafka → Flink → Iceberg on HDFS), Flink SQL integration with an Iceberg catalog, and the introduction of a table‑level proxy user property 'iceberg.user.proxy' = 'targetUser' to align with budgeting and permission systems.

The CDC workflow leverages the AutoDTS platform to capture source database changes into Kafka, then uses a customized Flink‑Iceberg sink (modified AppendStreamTableSink) to support upserts, primary keys, and the V2 format 'iceberg.format.version' = '2'.

Operational optimizations cover reducing empty commits via Flink.max-continuousempty-commits, recording watermarks in table properties, fast table deletion with extended FileIO, copy‑on‑write sink implementation, bucket configuration ( partition.bucket.source='id', partition.bucket.num='10'), and regular small‑file merging and orphan‑file cleanup.

Resulting benefits include ingestion latency dropping from over two hours to under ten minutes, real‑time analytical capabilities, accelerated feature‑engineering pipelines, and near real‑time CDC data availability, all supporting a unified stream‑batch data warehouse.

Future plans focus on tracking newer Iceberg releases, building a near real‑time warehouse, deepening stream‑batch integration, and enabling multi‑dimensional analysis with Presto and Spark.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Real-time Data Iceberg Lakehouse CDC

Written by

HomeTech

HomeTech tech sharing

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.