Big Data 17 min read

Why Choose Apache Iceberg? Tencent’s Optimizations and Real‑World Practices

This article examines the strengths and weaknesses of Apache Iceberg, explains why Tencent selected it over alternatives, details Tencent’s own enhancements and integration with Flink, Spark, and other engines, and shares multiple real‑world implementations for building enterprise‑grade real‑time data lakes.

Big Data Technology & Architecture

Nov 8, 2021

Why Choose Apache Iceberg? Tencent’s Optimizations and Real‑World Practices

In recent years, the demand for diverse big‑data storage and processing has driven enterprises to build unified data‑lake storage and support various analytical workloads. Apache Iceberg, an open‑source table format with ACID capabilities, has become a hot solution alongside Apache Hudi and Delta Lake.

Why choose Iceberg? Iceberg addresses key pain points such as T+0 data landing, reducing data‑correction costs, and providing engine‑agnostic data organization that works with Flink, Hive, Spark, etc. Its elegant architecture, open format, and optimization for object storage make it attractive for large‑scale deployments.

Tencent’s optimizations and improvements include implementing row‑level delete and update, adapting Spark 3.0 DataSource V2, adding Flink support, and contributing these changes back to the community. Tencent also faces challenges like building upstream/downstream adapters, validating core maturity at massive scale, and integrating with existing data‑access solutions.

Typical practices

Flink + Iceberg at Tongcheng‑Elong

Faced with ORC‑based small‑file issues, the team migrated Hive SQL to Iceberg by simply changing the catalog, achieving faster queries thanks to Iceberg’s manifest‑based pruning. Example migration:

INSERT INTO hive_catalog.db.hive_table SELECT * FROM kafka_table

INSERT INTO iceberg_catalog.db.iceberg_table SELECT * FROM kafka_table

Additional optimizations such as small‑file compaction, limit and filter push‑down, and snapshot expiration were implemented, with code like:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Actions.forTable(env, table)
    .rewriteDataFiles()
    .execute();

Real‑time data‑warehouse construction

Iceberg’s support for read‑write separation, concurrent reads, incremental reads, and near‑real‑time visibility enables a unified batch‑stream architecture. By leveraging Iceberg’s commit semantics, data becomes visible instantly, reducing latency from hours to minutes.

Iceberg + Spark 3 integration

Steps to compile Iceberg 0.11, add the iceberg-spark3-runtime-0.11.1.jar to Spark’s classpath, configure the catalog (either Hadoop‑type or Hive‑metastore), and create tables using the USING ICEBERG syntax are described, along with examples of altering file format to ORC.

Overall, Iceberg is rapidly evolving with contributions from major companies worldwide. Tencent’s experience demonstrates that with proper optimizations and community involvement, Iceberg can serve as a robust foundation for enterprise‑grade, real‑time data lakes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink real-time analytics Data Lake Spark Apache Iceberg

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Typical practices

Flink + Iceberg at Tongcheng‑Elong

Real‑time data‑warehouse construction

Iceberg + Spark 3 integration

Big Data Technology & Architecture

How this landed with the community

Was this worth your time?

0 Comments

Iceberg + Spark 3 integration