Big Data 20 min read

Building a Real‑Time Data Warehouse with Apache Flink and Apache Iceberg: Architecture, Challenges, and Best Practices

This article presents Tencent's practical experience of constructing a real‑time data warehouse by integrating Apache Flink with Apache Iceberg, covering background pain points of traditional Lambda architectures, Iceberg's table format and capabilities, Flink‑Iceberg sink design, small‑file handling, and future roadmap for a unified streaming‑batch data lake.

Big Data Technology Architecture

Aug 10, 2021

Building a Real‑Time Data Warehouse with Apache Flink and Apache Iceberg: Architecture, Challenges, and Best Practices

Abstract – Apache Flink is a popular unified stream‑batch engine, and modern data‑lake formats such as Iceberg, Hudi and Delta enable unified storage. Iceberg currently supports Flink’s DataStream/Table APIs and integrates with Flink 1.11.x.

Background & Pain Points – High‑volume applications (e.g., mini‑programs, video accounts) generate PB‑EB level data daily. Traditional data‑platform architectures rely on a Lambda design with separate offline (batch) and online (stream) layers, leading to duplicated pipelines, high operational cost, and data inconsistency.

Lambda Architecture Issues – Offline jobs (Spark, scheduled every T+1) cause latency, while real‑time jobs (Flink + Kafka) require separate code bases. Maintaining two engines inflates development and O&M effort, and data must be processed twice, increasing inconsistency risk.

Kappa Architecture Issues – Although Kappa removes the batch layer, it still depends heavily on message queues, suffers from ordering problems, lacks OLAP optimizations on Kafka, and adds complexity when downstream systems (ClickHouse, Elasticsearch, Hive) need to consume the data.

Iceberg Overview – Iceberg is an open table format that sits between compute (Flink, Spark) and storage (Parquet, ORC, Avro). It provides snapshot‑based read/write separation, stream‑batch unified access, engine‑agnostic tables, ACID semantics, and schema/partition evolution.

Iceberg File Organization – A table consists of snapshots, each referencing multiple manifests, which in turn list DataFiles. Snapshots inherit previous data, enabling efficient incremental reads.

Read/Write Process – Writes create a provisional snapshot; only after commit does the snapshot become visible. Multiple snapshots can be read concurrently, allowing time‑travel queries. Flink’s Iceberg sink writes DataFiles, notifies an IcebergFileCommitter, which performs a single commit to make files visible.

Real‑Time Small‑File Problem – Frequent commits (e.g., every 30 s) generate many tiny DataFiles, which can overwhelm the sink. Tencent implements a compaction operator that aggregates small files before committing, reducing file count and latency.

Flink + Iceberg Real‑Time Warehouse – By leveraging Iceberg’s snapshot visibility and incremental read capabilities, a near‑real‑time data lake is built. Iceberg replaces Kafka as the streaming backbone, offering unified storage, OLAP‑friendly columnar formats, and efficient back‑tracking.

Advantages & Trade‑offs – Benefits include unified stream‑batch storage, OLAP support in the middle layer, efficient back‑tracking, and lower storage cost. Drawbacks are a shift from true real‑time to near‑real‑time latency and additional integration effort.

Future Plans – Enhance Iceberg core (row‑level deletes, richer SQL extensions, unified indexing), integrate Alluxio for cache‑accelerated queries, automate schema extraction, improve metadata management, and connect the platform with internal systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Batch Processing Apache Flink Real-Time Data Warehouse Apache Iceberg

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.