Big Data 13 min read

Flink + Iceberg 0.11 Practices in Qunar Data Platform

This article shares Qunar's experience using Flink together with Apache Iceberg 0.11 to address real‑time data warehouse challenges, covering background pain points, Iceberg architecture, solutions for Kafka data loss and Hive latency, and optimization practices such as small‑file handling, sorting, and checkpoint management.

DataFunTalk

Jun 21, 2021

Flink + Iceberg 0.11 Practices in Qunar Data Platform

Background and Pain Points

Qunar encountered issues when using Flink for real‑time data warehousing, including Kafka data loss and performance pressure from near‑real‑time Hive queries. Iceberg 0.11 introduced features that address these scenarios, offering advantages over Kafka in specific use cases.

Iceberg Architecture Overview

Iceberg stores data in data files (typically Parquet) located in a data directory. Each data file is described by a manifest file containing metadata such as file path, partition, and column statistics. A snapshot represents the table state at a point in time and references a list of manifest files.

Pain Point 1: Kafka Data Loss

High storage costs and short retention periods in Kafka cause data loss when consumption lags. The solution is to offload less time‑critical data to Iceberg, which supports real‑time SQL reads and retains historical data, reducing Kafka pressure while ensuring data availability.

Why Iceberg is Near‑Real‑Time

Iceberg commits transactions at the file level, preventing per‑second transaction bursts that would explode file counts. It lacks an online service node, so writes are checkpoint‑driven; data becomes visible only after a checkpoint completes.

Pain Point 2: Hive Near‑Real‑Time Slowdown

Increasing partitions and metadata in Hive lead to heavy metastore pressure and slower query planning. Iceberg stores metadata in a scalable distributed file system, avoiding a centralized metastore bottleneck.

Optimization Practices

Small File Handling : Prior to Iceberg 0.11, batch APIs were needed to merge small files. Iceberg 0.11 adds streaming small‑file merging via hash distribution mode.

Table table = findTable(options, conf);
Actions.forTable(table).rewriteDataFiles()
       .targetSizeInBytes(10 * 1024) // 10KB
       .execute();

Creating a table with hash distribution:

CREATE TABLE city_table (
    province BIGINT,
    city STRING
) PARTITIONED BY (province, city) WITH (
    'write.distribution-mode'='hash'
);

Sorting Support : Iceberg 0.11 adds native sorting for Flink, enabling faster scans by pruning files using column min/max statistics.

insert into Iceberg_table select days from Kafka_tbl order by days, province_id;

Sorting improves query efficiency by allowing the planner to skip files whose bounds do not satisfy the query predicates.

Flink‑Iceberg Integration Demo

Enable streaming execution and table options:

set execution.type = streaming;
set table.dynamic-table-options.enabled=true;

CREATE CATALOG Iceberg_catalog WITH (
    'type'='Iceberg',
    'catalog-type'='Hive',
    'uri'='thrift://localhost:9083'
);

Ingest Kafka data into Iceberg:

insert into Iceberg_catalog.Iceberg_db.tbl1
    select * from Kafka_tbl;

Stream data between Iceberg tables with options:

insert into Iceberg_catalog.Iceberg_db.tbl2
    select * from Iceberg_catalog.Iceberg_db.tbl1
    /*+ OPTIONS('streaming'='true', 'monitor-interval'='10s', snapshot-id='3821550127947089987') */;

Key Takeaways

Iceberg 0.11 brings valuable features such as streaming small‑file merging, native sorting, and efficient metadata handling, which together alleviate Kafka data loss, reduce Hive metadata pressure, and enable near‑real‑time analytics with lower operational overhead.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Real-time Processing Flink Kafka Hive Data Lake Iceberg

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.