Big Data 12 min read

Using Apache Iceberg 0.11 with Flink for Real‑time Data Lake: Architecture, Pain Points, and Solutions

This article examines the challenges of using Kafka, Flink, and Hive for real‑time data warehousing, introduces Apache Iceberg 0.11 as a solution, details its architecture, query planning, Flink integration, code examples, optimization techniques, and summarizes the benefits for large‑scale data processing.

Qunar Tech Salon

Jun 21, 2021

Using Apache Iceberg 0.11 with Flink for Real‑time Data Lake: Architecture, Pain Points, and Solutions

The author, who joined Qunar in 2021 and works on Flink operations and platform development, describes problems encountered when using Flink for real‑time data warehousing with Kafka and Hive, such as Kafka data loss and Hive metadata pressure, and explains how Iceberg 0.11 addresses these scenarios.

The original architecture stored real‑time data in Kafka, then consumed it via Flink SQL or Flink DataStream through a self‑built platform that submitted SQL and DataStream jobs.

Pain points include high Kafka storage costs, short retention causing data loss under back‑pressure, and growing Hive metadata that slows query planning and burdens the metadata database.

Iceberg’s architecture is introduced, covering key concepts: data files (Parquet files stored in the data directory), manifest files (describing each data file’s path, partition, and column statistics), snapshots (table state at a point in time), and how query planning uses metadata filtering, snapshot IDs, and manifest lists to locate relevant files efficiently.

The query plan section explains metadata filtering on partition data, column‑level statistics, and the role of snapshot IDs linking to groups of manifest files.

Flink integration is detailed with two main components: IcebergStreamWriter, which writes records to Avro/Parquet/ORC files and creates Iceberg DataFiles, and IcebergFilesCommitter, which collects DataFiles at each checkpoint and commits a transaction to Iceberg, ensuring data visibility after checkpoint completion.

A practical Flink‑Iceberg demo is provided. It shows how to enable streaming execution (set execution.type = streaming), activate table SQL hints (set table.dynamic-table-options.enabled=true), register an Iceberg catalog, and run SQL statements such as:

CREATE CATALOG Iceberg_catalog WITH (
  'type'='Iceberg',
  'catalog-type'='Hive',
  'uri'='thrift://localhost:9083'
);

and data ingestion commands:

INSERT INTO Iceberg_catalog.Iceberg_db.tbl1 SELECT * FROM Kafka_tbl;

and streaming inserts with options:

INSERT INTO Iceberg_catalog.Iceberg_db.tbl2 SELECT * FROM Iceberg_catalog.Iceberg_db.tbl1 /*+ OPTIONS('streaming'='true','monitor-interval'='10s','snapshot-id'='3821550127947089987') */;

Parameter explanations clarify that monitor-interval controls how often new data files are watched (default 1 s) and start-snapshot-id determines the snapshot from which data is read.

The author shares a pitfall: writing to Iceberg via SQL Client without enabling checkpoints results in data files being written but metadata not updated, causing empty query results; enabling checkpoints is mandatory for both SQL and DataStream ingestion.

Real‑time query examples illustrate how data changes within one second after ingestion, demonstrating Iceberg’s near‑real‑time capabilities.

Optimization practices are discussed. For small‑file handling, Iceberg 0.11 introduces streaming small‑file merging using hash distribution mode, eliminating the need for batch‑style rewrite actions. Example code shows how to rewrite data files programmatically and how to create a table with 'write.distribution-mode'='hash'.

Sorting support added in Iceberg 0.11 is described. Previously only Spark supported sorting; now Flink can also sort data during ingestion, improving scan efficiency by allowing predicate push‑down on min‑max statistics. A demo insert with ORDER BY days, province_id is provided.

The manifest after sorting contains file paths, partitions, and lower/upper bounds for sorted columns, enabling the query planner to prune files effectively and reduce Hive metadata pressure.

In summary, Iceberg 0.11 brings valuable features such as streaming small‑file merging, hash‑based distribution, and sorting support, which together alleviate Kafka storage costs, prevent data loss, reduce Hive metadata bottlenecks, and enhance real‑time data lake performance for large‑scale big‑data workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Real-time Processing Flink SQL Data Lake metadata management Iceberg

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.