Big Data 18 min read

Practical Experience and Optimizations of Apache Iceberg in Tencent’s Big Data Ecosystem

This article reviews the advantages of Apache Iceberg for data lake storage, details Tencent’s custom optimizations and integration with Flink and Spark, and shares multiple real‑world implementations that demonstrate how Iceberg improves data consistency, reduces small‑file overhead, and enables near‑real‑time analytics in large‑scale big‑data environments.

Big Data Technology & Architecture

Jun 16, 2021

Practical Experience and Optimizations of Apache Iceberg in Tencent’s Big Data Ecosystem

The author investigated the strengths and weaknesses of Apache Iceberg and its adoption in major companies, aiming to provide practical insights for building unified data lake storage and data pipelines with ACID guarantees.

As big‑data storage and processing demands diversify, constructing a unified data lake that supports various analytics becomes crucial, and fast, consistent, atomic data pipelines are urgently needed.

To address this, Uber open‑sourced Apache Hudi, Databricks introduced Delta Lake, and Netflix launched Apache Iceberg, making ACID‑enabled table‑format middleware a hot topic in the data‑lake domain.

Why Choose Iceberg?

Provides T+0 data landing and processing, simplifying pipeline design and reducing latency through ACID capabilities.

Lowers data correction costs by supporting row‑level updates and deletes, avoiding expensive read‑modify‑write cycles.

Technical reasons for preferring Iceberg over other projects include:

Engine‑agnostic architecture that works with Flink, Hive, Spark, facilitating integration across Tencent’s heterogeneous pipelines.

Elegant design with a well‑defined type system and evolvable schema.

Optimized for object storage, avoiding costly listing and rename operations.

Additional evaluation criteria covered code quality, community vitality, and technology neutrality.

Tencent’s Optimizations and Improvements

Implemented row‑level delete and update operations, greatly reducing data‑fixing overhead.

Adapted Spark 3.0 DataSource V2 for seamless Iceberg integration.

Added Flink support to enable data landing in Iceberg format.

These enhancements improve Iceberg’s usability within Tencent and many internal improvements have been contributed back to the open‑source community.

Typical Practices

Flink Integration at Ctrip/Elong

Pain point: Columnar ORC storage caused HDFS small‑file issues and slow queries.

Flink + Iceberg solution: After evaluating Delta, Iceberg, and Hudi, Iceberg was chosen for its deep Flink integration and low migration cost.

Original Hive SQL:

INSERT INTO hive_catalog.db.hive_table SELECT * FROM kafka_table

After migration to Iceberg, only the catalog changes:

INSERT INTO iceberg_catalog.db.iIcebergceberg_table SELECT * FROM kafka_table

Iceberg’s manifest files store partition and statistics information, enabling O(1) file location and dramatically faster queries (e.g., filter‑task time reduced from 61.5 h to 22 min).

Flink CDC can write MySQL binlog directly to Iceberg, simplifying pipelines and reducing component maintenance.

Small‑file compression example:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Actions.forTable(env, table)
    .rewriteDataFiles()
    .execute();

Batch job scheduling example:

/home/flink/bin/flink run -p 10 -m yarn-cluster /home/work/iceberg-scheduler.jar my.sql

Operational tasks include orphan file cleanup, snapshot expiration, and data management.

Flink + Iceberg Real‑Time Warehouse at Qunar

Iceberg supports near‑real‑time data ingestion, read‑write separation, concurrent reads, incremental reads, and small‑file merging, enabling a unified batch‑stream analytics platform.

Key benefits of replacing Kafka with Iceberg:

Unified streaming and batch storage.

OLAP‑friendly middle layer.

Efficient time‑travel queries.

Reduced storage costs.

Iceberg’s support for Alluxio caching further accelerates data‑lake access.

Iceberg 0.11 and Spark 3.0 Integration

Steps to compile Iceberg 0.11.1, install the Spark runtime JAR, and configure Spark to use Iceberg with either Hadoop or Hive metastore are provided, along with commands to create databases, tables, and alter file formats.

Example to create an Iceberg table with ORC format:

CREATE TABLE iceberg_spark(id int, name string) USING iceberg TBLPROPERTIES ('write.format.default' = 'orc');

Metadata locations are shown via Hive configuration properties.

Summary

Iceberg is rapidly evolving with contributions from major tech companies worldwide; its open, neutral architecture and continuous improvements make it a promising foundation for data‑lake solutions in both international and Chinese enterprises.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Data Lake Spark Apache Iceberg Table Format

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Why Choose Iceberg?

Tencent’s Optimizations and Improvements

Typical Practices

Flink Integration at Ctrip/Elong

Flink + Iceberg Real‑Time Warehouse at Qunar

Iceberg 0.11 and Spark 3.0 Integration

Summary

Big Data Technology & Architecture

How this landed with the community

Was this worth your time?

0 Comments

Flink + Iceberg Real‑Time Warehouse at Qunar

Iceberg 0.11 and Spark 3.0 Integration