Iceberg Technology Overview and Its Application at Xiaomi: Practices, Stream‑Batch Integration, and Future Plans
This article introduces the Iceberg table format, explains its core architecture and advantages such as transactionality, implicit partitioning and row‑level updates, details Xiaomi's practical deployments—including CDC pipelines, partition strategies, compaction services, and stream‑batch integration—and outlines future development directions.
Iceberg is a table format built for large‑scale analytical data that abstracts storage files (Parquet, ORC, Avro) from compute engines like Spark, Flink, Trino, Hive, and Presto, allowing flexible storage choices (S3, OSS, HDFS) while hiding file‑level details from users.
The format organizes data in four layers: data files, manifest files that map files to partitions, snapshots that capture the state after each commit, and metadata files that track snapshots, enabling time‑travel queries.
Iceberg provides three major benefits: transactional guarantees that avoid dirty writes and enable read‑write isolation via snapshots; implicit partitioning with transform functions and partition evolution, which removes the need for explicit partition specifications; and row‑level updates through two table versions—V1 (Copy‑On‑Write) and V2 (Merge‑On‑Read) that support merge into and position delete / equality delete files.
At Xiaomi, more than 4,000 Iceberg tables (≈8 PB) are in production, with over 1,000 V1 tables for log scenarios and 3,000+ V2 tables for CDC use cases. Data is ingested via Flink CDC pipelines that capture MySQL, Oracle, and TiDB changes, route them through MQ, and write to Iceberg V2 tables.
Key practical insights include:
Choosing between bucket and truncate partitioning for auto‑increment primary keys—truncate is preferred to avoid compaction overhead and partition‑size growth.
Providing two partitioning options to users (bucket vs. truncate) and recommending truncate for high‑cardinality IDs.
Implementing product‑level UI that maps MySQL schemas to Iceberg schemas, supporting custom business keys.
Log ingestion benefits from implicit partitioning, which eliminates midnight data drift and improves latency handling for TV telemetry.
Compaction, snapshot expiration, and orphan‑file cleaning services run as Spark jobs to manage small files and metadata bloat.
For stream‑batch integration, Xiaomi replaces the traditional Lambda architecture with a unified Iceberg storage layer across ODS, DWD, and DWS. This eliminates the need for separate Hive and Kafka storage, allowing Flink to serve both real‑time and batch workloads, and supports back‑fill and change‑stream queries.
Data correction is handled via two approaches: overwrite (partition‑level replacement) and merge into (incremental upserts). Overwrite is simple and fast but can affect downstream real‑time jobs; merge into updates only changed records, reducing downstream impact but incurring join overhead.
Future work includes continued support for Flink CDC 2.0, optimizing compaction services to reduce resource consumption, and aligning Iceberg with Flink 1.14 to improve back‑pressure handling.
The Q&A section addresses common questions such as rolling back to a snapshot ( rollback ), handling upsert‑generated small files, partition vs. primary‑key updates, schema evolution, and integration with other storage engines.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.