Big Data 20 min read

Iceberg Data Lake: Technology Overview, Xiaomi Practices, and Stream‑Batch Integration

This article presents an overview of the Iceberg table format, its core architecture and advantages, details Xiaomi’s large‑scale deployment and use cases, explores stream‑batch integration with Spark and Flink, outlines data correction methods, future plans, and answers common technical questions.

DataFunTalk
DataFunTalk
DataFunTalk
Iceberg Data Lake: Technology Overview, Xiaomi Practices, and Stream‑Batch Integration

Iceberg is a table format for large analytical datasets that abstracts storage files (Parquet, ORC, Avro) from compute engines such as Spark, Flink, Hive, Trino, enabling flexible storage choices and providing transactional guarantees, implicit partitioning, and row‑level updates.

The architecture consists of four levels: data files, manifest files, snapshots, and metadata, allowing versioned reads and time‑travel.

Key advantages include transactionality to avoid dirty writes, snapshot‑based read/write isolation, implicit partitioning with partition evolution, and support for both Copy‑On‑Write (V1) and Merge‑On‑Read (V2) update models.

At Xiaomi, over 4,000 Iceberg tables (~8 PB) are deployed, with V1 tables for log data and V2 tables for CDC streams. Data is ingested via Flink CDC from MySQL/Oracle/TiDB through MQ into Iceberg V2 tables, enabling near‑real‑time analytics, schema evolution, and cost‑effective storage replacement for systems like Kudu.

Two partitioning strategies are offered: bucket and truncate, with truncate preferred for auto‑increment primary keys to reduce small‑file issues and improve compaction.

Log ingestion benefits from implicit partitioning, delayed‑data handling, Flink exactly‑once guarantees, and schema synchronization.

Background services include Compaction, Expire Snapshots, and Orphan Files cleaning to manage file proliferation.

For stream‑batch integration, the traditional Lambda architecture (separate batch and streaming paths) is replaced by a unified Iceberg storage layer across ODS, DWD, DWS, allowing a single compute engine (Spark or Flink) and simplifying data consistency.

Data correction can be performed using either partition overwrite or the more granular MERGE INTO syntax, which supports conditional delete, update, and insert operations while handling transaction isolation levels (serializable vs snapshot).

Future work includes tracking Flink CDC 2.0, optimizing compaction services, and integrating Iceberg with newer Flink versions (1.14).

The Q&A covers rollback, upsert performance, partitioning choices, schema evolution, and comparisons with Hudi, Alluxio, and StarRocks.

big dataFlinkStreamingdata lakeSparkicebergMerge Into
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.