Building a Lakehouse Architecture with Apache Iceberg and Flink: Practices and Insights
This article explains how to construct a lake‑house architecture using Apache Iceberg, detailing the migration from Hive, Flink‑SQL integration, proxy user support, CDC handling, copy‑on‑write sinks, and the resulting benefits for near‑real‑time data visibility and unified batch‑stream processing.
Introduction
This article introduces a method to build a lake‑house architecture based on Apache Iceberg, aiming to achieve minute‑level data visibility and explore the benefits of multi‑dimensional analysis.
Background: Data Warehouse Upgrade
The original data warehouse was built entirely on Hive and suffered from three major pain points (not listed here). Moving to Iceberg addresses these issues.
Key Features of Iceberg
Iceberg provides four essential capabilities: ACID semantics, incremental snapshot mechanism, an open table format, and support for both streaming and batch interfaces.
Lakehouse Architecture Practice
The lake‑house concept eliminates the distinction between lake and warehouse, allowing data to flow freely and integrate with diverse compute ecosystems.
1. Append Flow into the Lake
Log data (client, user, server logs) are ingested into Kafka, then written to Iceberg via Flink jobs, finally stored in HDFS.
2. Flink SQL Integration
The Flink 1.11 + Iceberg 0.11 stack was used. The following enhancements were made:
Meta Server now supports Iceberg Catalog.
SQL SDK extended to handle Iceberg Catalog.
Additionally, the platform opened Iceberg table management, allowing users to create tables via SQL.
3. Proxy‑User Support for Ingestion
To align with budgeting and permission systems, proxy‑user functionality was added so that data can be written to Iceberg under a specified account.
Key steps include:
Table‑level configuration: 'iceberg.user.proxy' = 'targetUser' Enable superuser and team‑account authentication.
Use proxy user when accessing HDFS.
Specify proxy user for Hive Metastore access (e.g., Spark’s org.apache.spark.deploy.security.HiveDelegationTokenProvider).
4. Flink SQL Ingestion Example (DDL + DML)
5. CDC Data Ingestion Flow
The AutoDTS platform captures business‑DB changes, streams them to Kafka, and distributes them to Iceberg via Flink.
6. CDC Integration with Flink SQL
Modifications to support CDC include improving the Iceberg sink (AppendStreamTableSink cannot handle CDC streams) and adding table management features such as primary key support and enabling Iceberg format version 2:
Primary key support (PR1978).
Set 'iceberg.format.version' = '2'.
7. Copy‑on‑Write Sink
Because Merge‑on‑Read does not merge small files, a copy‑on‑write sink was implemented. It uses a parallel StreamWriter for writes and a single‑threaded FileCommitter for commits.
Buffering added.
Checkpoint success check before write.
Bucket‑based grouping and write.
8‑10. Additional Practices
Other topics covered include small‑file merging, data cleanup, and integration with various compute engines.
Compute Engine – Flink
Flink is the core real‑time engine, tightly integrated with Iceberg for near‑real‑time ingestion and analysis.
Compute Engine – Hive
Hive provides batch SQL support for Iceberg, offering snapshot queries, small‑file compaction, and offline writes (INSERT, INSERT OVERWRITE, MERGE).
Compute Engine – Trino/Presto
AutoBI integrates Presto/Trino to query Iceberg tables directly, with ongoing work on metadata caching.
Pitfalls
The article lists encountered challenges (details omitted).
Benefits and Summary
Lakehouse integration yields unified storage, consistent data formats, and a unified compute layer, while batch‑stream convergence enables near‑real‑time analytics (minute‑level visibility).
Business benefits include faster data delivery, unified metrics, and a foundation for a near‑real‑time data warehouse.
Future Plans
Further roadmap items are illustrated with images (not reproduced here).
Thank you for reading.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
