Flink + Iceberg 0.11 Practices in Qunar Data Platform
This article shares Qunar's experience using Flink together with Apache Iceberg 0.11 to address real‑time data warehouse challenges, covering background pain points, Iceberg architecture, solutions for Kafka data loss and Hive latency, and optimization practices such as small‑file handling, sorting, and checkpoint management.
Background and Pain Points
Qunar encountered issues when using Flink for real‑time data warehousing, including Kafka data loss and performance pressure from near‑real‑time Hive queries. Iceberg 0.11 introduced features that address these scenarios, offering advantages over Kafka in specific use cases.
Iceberg Architecture Overview
Iceberg stores data in data files (typically Parquet) located in a data directory. Each data file is described by a manifest file containing metadata such as file path, partition, and column statistics. A snapshot represents the table state at a point in time and references a list of manifest files.
Pain Point 1: Kafka Data Loss
High storage costs and short retention periods in Kafka cause data loss when consumption lags. The solution is to offload less time‑critical data to Iceberg, which supports real‑time SQL reads and retains historical data, reducing Kafka pressure while ensuring data availability.
Why Iceberg is Near‑Real‑Time
Iceberg commits transactions at the file level, preventing per‑second transaction bursts that would explode file counts. It lacks an online service node, so writes are checkpoint‑driven; data becomes visible only after a checkpoint completes.
Pain Point 2: Hive Near‑Real‑Time Slowdown
Increasing partitions and metadata in Hive lead to heavy metastore pressure and slower query planning. Iceberg stores metadata in a scalable distributed file system, avoiding a centralized metastore bottleneck.
Optimization Practices
Small File Handling : Prior to Iceberg 0.11, batch APIs were needed to merge small files. Iceberg 0.11 adds streaming small‑file merging via hash distribution mode.
Table table = findTable(options, conf);
Actions.forTable(table).rewriteDataFiles()
.targetSizeInBytes(10 * 1024) // 10KB
.execute();Creating a table with hash distribution:
CREATE TABLE city_table (
province BIGINT,
city STRING
) PARTITIONED BY (province, city) WITH (
'write.distribution-mode'='hash'
);Sorting Support : Iceberg 0.11 adds native sorting for Flink, enabling faster scans by pruning files using column min/max statistics.
insert into Iceberg_table select days from Kafka_tbl order by days, province_id;Sorting improves query efficiency by allowing the planner to skip files whose bounds do not satisfy the query predicates.
Flink‑Iceberg Integration Demo
Enable streaming execution and table options:
set execution.type = streaming;
set table.dynamic-table-options.enabled=true;Register Iceberg catalog:
CREATE CATALOG Iceberg_catalog WITH (
'type'='Iceberg',
'catalog-type'='Hive',
'uri'='thrift://localhost:9083'
);Ingest Kafka data into Iceberg:
insert into Iceberg_catalog.Iceberg_db.tbl1
select * from Kafka_tbl;Stream data between Iceberg tables with options:
insert into Iceberg_catalog.Iceberg_db.tbl2
select * from Iceberg_catalog.Iceberg_db.tbl1
/*+ OPTIONS('streaming'='true', 'monitor-interval'='10s', snapshot-id='3821550127947089987') */;Key Takeaways
Iceberg 0.11 brings valuable features such as streaming small‑file merging, native sorting, and efficient metadata handling, which together alleviate Kafka data loss, reduce Hive metadata pressure, and enable near‑real‑time analytics with lower operational overhead.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
