Why Choose Apache Iceberg? Tencent’s Optimizations and Real‑World Practices
This article examines the strengths and weaknesses of Apache Iceberg, explains why Tencent selected it over alternatives, details Tencent’s own enhancements and integration with Flink, Spark, and other engines, and shares multiple real‑world implementations for building enterprise‑grade real‑time data lakes.
In recent years, the demand for diverse big‑data storage and processing has driven enterprises to build unified data‑lake storage and support various analytical workloads. Apache Iceberg, an open‑source table format with ACID capabilities, has become a hot solution alongside Apache Hudi and Delta Lake.
Why choose Iceberg? Iceberg addresses key pain points such as T+0 data landing, reducing data‑correction costs, and providing engine‑agnostic data organization that works with Flink, Hive, Spark, etc. Its elegant architecture, open format, and optimization for object storage make it attractive for large‑scale deployments.
Tencent’s optimizations and improvements include implementing row‑level delete and update, adapting Spark 3.0 DataSource V2, adding Flink support, and contributing these changes back to the community. Tencent also faces challenges like building upstream/downstream adapters, validating core maturity at massive scale, and integrating with existing data‑access solutions.
Typical practices
Flink + Iceberg at Tongcheng‑Elong
Faced with ORC‑based small‑file issues, the team migrated Hive SQL to Iceberg by simply changing the catalog, achieving faster queries thanks to Iceberg’s manifest‑based pruning. Example migration:
INSERT INTO hive_catalog.db.hive_table SELECT * FROM kafka_tableto
INSERT INTO iceberg_catalog.db.iceberg_table SELECT * FROM kafka_tableAdditional optimizations such as small‑file compaction, limit and filter push‑down, and snapshot expiration were implemented, with code like:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Actions.forTable(env, table)
.rewriteDataFiles()
.execute();Real‑time data‑warehouse construction
Iceberg’s support for read‑write separation, concurrent reads, incremental reads, and near‑real‑time visibility enables a unified batch‑stream architecture. By leveraging Iceberg’s commit semantics, data becomes visible instantly, reducing latency from hours to minutes.
Iceberg + Spark 3 integration
Steps to compile Iceberg 0.11, add the iceberg-spark3-runtime-0.11.1.jar to Spark’s classpath, configure the catalog (either Hadoop‑type or Hive‑metastore), and create tables using the USING ICEBERG syntax are described, along with examples of altering file format to ORC.
Overall, Iceberg is rapidly evolving with contributions from major companies worldwide. Tencent’s experience demonstrates that with proper optimizations and community involvement, Iceberg can serve as a robust foundation for enterprise‑grade, real‑time data lakes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
