Lakehouse Architecture Practice with Flink and Iceberg: Real‑time Data Ingestion and Management
This article details a lakehouse architecture built on Flink and Iceberg that addresses Hive‑based warehouse limitations by enabling ACID transactions, incremental snapshots, stream‑batch unification, CDC support, and various operational optimizations, ultimately achieving near real‑time data ingestion and analytics.
The presentation by Di Xingxing at the Shanghai meetup on April 17 introduced a lakehouse architecture that combines Flink and Iceberg to overcome the shortcomings of a traditional Hive‑based data warehouse, such as lack of ACID support, poor timeliness, and limited schema evolution.
Iceberg’s key features—full ACID semantics, incremental snapshot mechanism, open table format, and native stream‑batch interfaces—provide the foundation for a unified storage layer.
Implementation highlights include an append‑only log ingestion pipeline (Kafka → Flink → Iceberg on HDFS), Flink SQL integration with an Iceberg catalog, and the introduction of a table‑level proxy user property 'iceberg.user.proxy' = 'targetUser' to align with budgeting and permission systems.
The CDC workflow leverages the AutoDTS platform to capture source database changes into Kafka, then uses a customized Flink‑Iceberg sink (modified AppendStreamTableSink) to support upserts, primary keys, and the V2 format 'iceberg.format.version' = '2' .
Operational optimizations cover reducing empty commits via Flink.max-continuousempty-commits , recording watermarks in table properties, fast table deletion with extended FileIO, copy‑on‑write sink implementation, bucket configuration ( partition.bucket.source='id' , partition.bucket.num='10' ), and regular small‑file merging and orphan‑file cleanup.
Resulting benefits include ingestion latency dropping from over two hours to under ten minutes, real‑time analytical capabilities, accelerated feature‑engineering pipelines, and near real‑time CDC data availability, all supporting a unified stream‑batch data warehouse.
Future plans focus on tracking newer Iceberg releases, building a near real‑time warehouse, deepening stream‑batch integration, and enabling multi‑dimensional analysis with Presto and Spark.
HomeTech
HomeTech tech sharing
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.