Big Data 17 min read

Why Choose Apache Iceberg? Tencent’s Optimizations and Real‑World Practices

This article examines the strengths and weaknesses of Apache Iceberg, explains why Tencent selected it over alternatives, details Tencent’s own enhancements and integration with Flink, Spark, and other engines, and shares multiple real‑world implementations for building enterprise‑grade real‑time data lakes.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Why Choose Apache Iceberg? Tencent’s Optimizations and Real‑World Practices

In recent years, the demand for diverse big‑data storage and processing has driven enterprises to build unified data‑lake storage and support various analytical workloads. Apache Iceberg, an open‑source table format with ACID capabilities, has become a hot solution alongside Apache Hudi and Delta Lake.

Why choose Iceberg? Iceberg addresses key pain points such as T+0 data landing, reducing data‑correction costs, and providing engine‑agnostic data organization that works with Flink, Hive, Spark, etc. Its elegant architecture, open format, and optimization for object storage make it attractive for large‑scale deployments.

Tencent’s optimizations and improvements include implementing row‑level delete and update, adapting Spark 3.0 DataSource V2, adding Flink support, and contributing these changes back to the community. Tencent also faces challenges like building upstream/downstream adapters, validating core maturity at massive scale, and integrating with existing data‑access solutions.

Typical practices

Flink + Iceberg at Tongcheng‑Elong

Faced with ORC‑based small‑file issues, the team migrated Hive SQL to Iceberg by simply changing the catalog, achieving faster queries thanks to Iceberg’s manifest‑based pruning. Example migration:

INSERT INTO hive_catalog.db.hive_table SELECT * FROM kafka_table

to

INSERT INTO iceberg_catalog.db.iceberg_table SELECT * FROM kafka_table

Additional optimizations such as small‑file compaction, limit and filter push‑down, and snapshot expiration were implemented, with code like:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Actions.forTable(env, table)
    .rewriteDataFiles()
    .execute();

Real‑time data‑warehouse construction

Iceberg’s support for read‑write separation, concurrent reads, incremental reads, and near‑real‑time visibility enables a unified batch‑stream architecture. By leveraging Iceberg’s commit semantics, data becomes visible instantly, reducing latency from hours to minutes.

Iceberg + Spark 3 integration

Steps to compile Iceberg 0.11, add the iceberg-spark3-runtime-0.11.1.jar to Spark’s classpath, configure the catalog (either Hadoop‑type or Hive‑metastore), and create tables using the USING ICEBERG syntax are described, along with examples of altering file format to ORC.

Overall, Iceberg is rapidly evolving with contributions from major companies worldwide. Tencent’s experience demonstrates that with proper optimizations and community involvement, Iceberg can serve as a robust foundation for enterprise‑grade, real‑time data lakes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

FlinkReal-time analyticsData LakeSparkApache Iceberg
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.