Big Data 19 min read

Apache Iceberg in Tencent: Architecture, Spark Read/Write, Production Practices, and Data Governance

This article presents an in‑depth overview of Apache Iceberg as used at Tencent, covering its table format architecture, Spark read/write mechanisms, production challenges and optimizations such as schema evolution, file filtering, upsert strategies, and the surrounding data‑governance services.

DataFunSummit

Oct 29, 2022

Apache Iceberg in Tencent: Architecture, Spark Read/Write, Production Practices, and Data Governance

Introduction – The talk introduces Iceberg, its open table format, and how it is implemented in Java, highlighting key features like transactional support, strong scalability, schema and partition evolution, and storage abstraction across HDFS and object stores.

Iceberg Table Structure – Iceberg tables consist of three layers: the catalog layer (pointers to storage locations with atomic updates via HiveCatalog, HadoopCatalog, JDBCCatalog), the metadata layer (JSON metadata files, ManifestList, ManifestFile, DataFile), and the data layer (actual Parquet/ORC/Avro files with min‑max and partition statistics).

ACID Guarantees – ACID properties are ensured by the catalog during commit phases, enabling atomic updates, snapshot versioning, and time‑travel reads that allow concurrent reads and writes.

Schema and Partition Evolution – Iceberg supports schema evolution (add/delete/modify columns) and hidden partition evolution, allowing users to change partition specs without altering query statements, with Spark automatically applying appropriate partition filters.

Spark Integration – Spark reads and writes Iceberg via the DataSourceV2 API. Writes generate WriteTasks that produce DataFiles, which are aggregated into ManifestFiles and ManifestLists. Reads leverage metadata for efficient file filtering based on partition summaries and min‑max statistics, supporting parallel scans and projection handling.

Upsert Strategies – Two upsert models are discussed: Copy‑On‑Write (COW) and Merge‑On‑Read (MOR). MOR uses DeleteFiles (Position Delete and Equality Delete) to mark rows for removal without rewriting entire files.

Production Optimizations – Several real‑world challenges and solutions are described: handling wide tables by committing ManifestFiles incrementally, auto‑merge‑schema for frequent schema changes, schema‑aware file filtering, Z‑Order layout optimization, Parquet Bloom filters, Iceberg‑specific indexes, vectorized read improvements for Decimal types, multi‑threaded planning, and view support.

Data Governance Services – A suite of services built around Iceberg includes automatic compaction, snapshot expiration, clustering (Z‑Order), column lifecycle management (dropping unused columns), and monitoring via MQ/CDC pipelines, providing users with transparent management of storage size and performance.

Conclusion – The presentation summarizes the practical experience of deploying Iceberg at Tencent, emphasizing the importance of metadata‑driven optimizations and automated governance to achieve scalable, efficient lakehouse solutions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Governance Data Lake Spark Schema Evolution Apache Iceberg Table Optimization

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.