Big Data 10 min read

Building a Lakehouse Architecture with Apache Iceberg and Flink: Practices and Insights

This article explains how to construct a lake‑house architecture using Apache Iceberg, detailing the migration from Hive, Flink‑SQL integration, proxy user support, CDC handling, copy‑on‑write sinks, and the resulting benefits for near‑real‑time data visibility and unified batch‑stream processing.

DataFunTalk

Jul 10, 2021

Building a Lakehouse Architecture with Apache Iceberg and Flink: Practices and Insights

Introduction

This article introduces a method to build a lake‑house architecture based on Apache Iceberg, aiming to achieve minute‑level data visibility and explore the benefits of multi‑dimensional analysis.

Background: Data Warehouse Upgrade

The original data warehouse was built entirely on Hive and suffered from three major pain points (not listed here). Moving to Iceberg addresses these issues.

Key Features of Iceberg

Iceberg provides four essential capabilities: ACID semantics, incremental snapshot mechanism, an open table format, and support for both streaming and batch interfaces.

Lakehouse Architecture Practice

The lake‑house concept eliminates the distinction between lake and warehouse, allowing data to flow freely and integrate with diverse compute ecosystems.

1. Append Flow into the Lake

Log data (client, user, server logs) are ingested into Kafka, then written to Iceberg via Flink jobs, finally stored in HDFS.

2. Flink SQL Integration

The Flink 1.11 + Iceberg 0.11 stack was used. The following enhancements were made:

Meta Server now supports Iceberg Catalog.

SQL SDK extended to handle Iceberg Catalog.

Additionally, the platform opened Iceberg table management, allowing users to create tables via SQL.

3. Proxy‑User Support for Ingestion

To align with budgeting and permission systems, proxy‑user functionality was added so that data can be written to Iceberg under a specified account.

Key steps include:

Table‑level configuration: 'iceberg.user.proxy' = 'targetUser' Enable superuser and team‑account authentication.

Use proxy user when accessing HDFS.

Specify proxy user for Hive Metastore access (e.g., Spark’s org.apache.spark.deploy.security.HiveDelegationTokenProvider).

4. Flink SQL Ingestion Example (DDL + DML)

5. CDC Data Ingestion Flow

The AutoDTS platform captures business‑DB changes, streams them to Kafka, and distributes them to Iceberg via Flink.

6. CDC Integration with Flink SQL

Modifications to support CDC include improving the Iceberg sink (AppendStreamTableSink cannot handle CDC streams) and adding table management features such as primary key support and enabling Iceberg format version 2:

Primary key support (PR1978).

Set 'iceberg.format.version' = '2'.

7. Copy‑on‑Write Sink

Because Merge‑on‑Read does not merge small files, a copy‑on‑write sink was implemented. It uses a parallel StreamWriter for writes and a single‑threaded FileCommitter for commits.

Buffering added.

Checkpoint success check before write.

Bucket‑based grouping and write.

8‑10. Additional Practices

Other topics covered include small‑file merging, data cleanup, and integration with various compute engines.

Compute Engine – Flink

Flink is the core real‑time engine, tightly integrated with Iceberg for near‑real‑time ingestion and analysis.

Compute Engine – Hive

Hive provides batch SQL support for Iceberg, offering snapshot queries, small‑file compaction, and offline writes (INSERT, INSERT OVERWRITE, MERGE).

Compute Engine – Trino/Presto

AutoBI integrates Presto/Trino to query Iceberg tables directly, with ongoing work on metadata caching.

Pitfalls

The article lists encountered challenges (details omitted).

Benefits and Summary

Lakehouse integration yields unified storage, consistent data formats, and a unified compute layer, while batch‑stream convergence enables near‑real‑time analytics (minute‑level visibility).

Business benefits include faster data delivery, unified metrics, and a foundation for a near‑real‑time data warehouse.

Future Plans

Further roadmap items are illustrated with images (not reproduced here).

Thank you for reading.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink real-time analytics Apache Iceberg Lakehouse CDC

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.