Big Data 12 min read

Exploring Real-Time Lakehouse Architecture with Apache Paimon

This article presents Xiaomi's real-time lakehouse architecture, outlines its current challenges, introduces Apache Paimon and several use‑case scenarios—including stream join optimization, streaming upserts, and lookup joins—while discussing expected benefits and future directions for a more efficient, unified data platform.

DataFunSummit
DataFunSummit
DataFunSummit
Exploring Real-Time Lakehouse Architecture with Apache Paimon

The presentation begins with an overview of Xiaomi's existing real‑time lakehouse stack, which primarily relies on Flink, Talos, and Iceberg, and highlights three major pain points: high computation cost due to limited streaming support in Iceberg, architectural complexity and stability issues, and elevated storage costs caused by data duplication across real‑time and offline pipelines.

Three typical cases are then examined. The first case demonstrates a Flink streaming job that consumes two Talos event streams, performs filtering and transformation, and executes a primary‑key based dual‑stream join, requiring external KV stores such as HBase or Pegasus for delayed data handling.

The second case explores streaming upserts, showing that Iceberg's upsert mechanism generates excessive small files and costly compactions, whereas Paimon’s layered LSM structure enables efficient incremental merges and avoids full‑table compactions.

The third case focuses on lookup joins, where Paimon can serve as a lookup source using a three‑tier storage hierarchy (memory, local disk, remote file system), reducing the need for costly external KV systems while still supporting large‑scale joins.

Apache Paimon is introduced as a lakehouse solution that integrates LSM storage, offering strong support for both streaming and batch workloads, changelog generation, and flexible data pruning. Expected benefits include lower compaction resource consumption, reduced small‑file overhead, and improved stream read performance.

Finally, the article outlines future work: deeper integration of Paimon in CDC pipelines, automated maintenance services (snapshot expiration, data merging, partition TTL), and intelligent optimization such as automated clustering recommendations, all aimed at simplifying the data pipeline, cutting costs, and enhancing stability.

Big DataFlinkStreamingData WarehouseIcebergApache Paimonreal-time lakehouse
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.