Practical Experience and Optimizations of Apache Iceberg in Tencent’s Big Data Ecosystem
This article reviews the advantages of Apache Iceberg for data lake storage, details Tencent’s custom optimizations and integration with Flink and Spark, and shares multiple real‑world implementations that demonstrate how Iceberg improves data consistency, reduces small‑file overhead, and enables near‑real‑time analytics in large‑scale big‑data environments.
The author investigated the strengths and weaknesses of Apache Iceberg and its adoption in major companies, aiming to provide practical insights for building unified data lake storage and data pipelines with ACID guarantees.
As big‑data storage and processing demands diversify, constructing a unified data lake that supports various analytics becomes crucial, and fast, consistent, atomic data pipelines are urgently needed.
To address this, Uber open‑sourced Apache Hudi, Databricks introduced Delta Lake, and Netflix launched Apache Iceberg, making ACID‑enabled table‑format middleware a hot topic in the data‑lake domain.
Why Choose Iceberg?
Provides T+0 data landing and processing, simplifying pipeline design and reducing latency through ACID capabilities.
Lowers data correction costs by supporting row‑level updates and deletes, avoiding expensive read‑modify‑write cycles.
Technical reasons for preferring Iceberg over other projects include:
Engine‑agnostic architecture that works with Flink, Hive, Spark, facilitating integration across Tencent’s heterogeneous pipelines.
Elegant design with a well‑defined type system and evolvable schema.
Optimized for object storage, avoiding costly listing and rename operations.
Additional evaluation criteria covered code quality, community vitality, and technology neutrality.
Tencent’s Optimizations and Improvements
Implemented row‑level delete and update operations, greatly reducing data‑fixing overhead.
Adapted Spark 3.0 DataSource V2 for seamless Iceberg integration.
Added Flink support to enable data landing in Iceberg format.
These enhancements improve Iceberg’s usability within Tencent and many internal improvements have been contributed back to the open‑source community.
Typical Practices
Flink Integration at Ctrip/Elong
Pain point: Columnar ORC storage caused HDFS small‑file issues and slow queries.
Flink + Iceberg solution: After evaluating Delta, Iceberg, and Hudi, Iceberg was chosen for its deep Flink integration and low migration cost.
Original Hive SQL:
INSERT INTO hive_catalog.db.hive_table SELECT * FROM kafka_tableAfter migration to Iceberg, only the catalog changes:
INSERT INTO iceberg_catalog.db.iIcebergceberg_table SELECT * FROM kafka_tableIceberg’s manifest files store partition and statistics information, enabling O(1) file location and dramatically faster queries (e.g., filter‑task time reduced from 61.5 h to 22 min).
Flink CDC can write MySQL binlog directly to Iceberg, simplifying pipelines and reducing component maintenance.
Small‑file compression example:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Actions.forTable(env, table)
.rewriteDataFiles()
.execute();Batch job scheduling example:
/home/flink/bin/flink run -p 10 -m yarn-cluster /home/work/iceberg-scheduler.jar my.sqlOperational tasks include orphan file cleanup, snapshot expiration, and data management.
Flink + Iceberg Real‑Time Warehouse at Qunar
Iceberg supports near‑real‑time data ingestion, read‑write separation, concurrent reads, incremental reads, and small‑file merging, enabling a unified batch‑stream analytics platform.
Key benefits of replacing Kafka with Iceberg:
Unified streaming and batch storage.
OLAP‑friendly middle layer.
Efficient time‑travel queries.
Reduced storage costs.
Iceberg’s support for Alluxio caching further accelerates data‑lake access.
Iceberg 0.11 and Spark 3.0 Integration
Steps to compile Iceberg 0.11.1, install the Spark runtime JAR, and configure Spark to use Iceberg with either Hadoop or Hive metastore are provided, along with commands to create databases, tables, and alter file formats.
Example to create an Iceberg table with ORC format:
CREATE TABLE iceberg_spark(id int, name string) USING iceberg TBLPROPERTIES ('write.format.default' = 'orc');Metadata locations are shown via Hive configuration properties.
Summary
Iceberg is rapidly evolving with contributions from major tech companies worldwide; its open, neutral architecture and continuous improvements make it a promising foundation for data‑lake solutions in both international and Chinese enterprises.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
