Practical Use of Apache Iceberg in Microvision's Data Warehouse: Architecture, Real‑time Integration, and Table Maintenance
This article details why Microvision adopted Apache Iceberg, how it replaces parts of their Lambda‑architecture data pipeline, the real‑time and offline use cases, table‑maintenance practices such as snapshot cleanup and small‑file merging, and lessons learned from the implementation.
Microvision’s data platform combines offline and real‑time processing using a Lambda architecture where data arrives from client reports and backend databases via a message queue, then flows through Hive (offline) and Flink/Kafka (real‑time). The team identified high storage costs, duplicate pipelines, and consistency issues as major pain points.
To address these, they evaluated Apache Iceberg as a unified storage layer that can serve both offline and real‑time workloads. Compared with Hive, Iceberg offers file‑level predicate push‑down, versioned reads, lower write latency, and significantly cheaper storage because it relies on HDFS‑like storage rather than Kafka.
In the early stage, Iceberg was used to replace part of the real‑time path, feeding incremental data to downstream systems while keeping Hive for existing T+1 offline aggregates. An incremental read interface allowed only new data to be pushed, avoiding full‑load transfers. Upsert was initially attempted but discarded due to copy‑on‑write overhead; instead, a merge‑on‑read approach is planned.
The team also tackled table‑maintenance challenges: periodic snapshot deletion, small‑file compaction, and orphan file cleanup. Small‑file merging is performed with Spark using two strategies—BinPack (simple packing) and Sort (group‑by a key and sort). The Sort strategy reduced table size by 40‑70 % and dramatically improved file‑level pruning, cutting the number of scanned files from dozens to a handful for point queries.
For reliable streaming‑batch integration, they enhanced Flink sources with partition‑aware state and checkpoint alignment, ensuring exactly‑once semantics and enabling downstream jobs to detect when a table’s data is complete via snapshot timestamps.
The Q&A highlighted scheduling frequencies for Iceberg‑based aggregations, storage cluster considerations, and the impact of model depth on real‑time latency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
