Big Data 16 min read

How Apache Paimon Transforms Real‑Time Lakehouse Architecture

This article analyzes the limitations of a traditional Flink + Talos + Iceberg real‑time lakehouse, introduces Apache Paimon's lakehouse table format and LSM storage, and demonstrates three practical use cases—partial‑update widening, streaming upsert, and lookup join—showing cost, stability, and performance improvements while outlining future roadmap.

Big Data Tech Team

Mar 25, 2025

How Apache Paimon Transforms Real‑Time Lakehouse Architecture

Background

The original real‑time lakehouse architecture relied on Flink, an internally built message queue Talos, and Iceberg for storage. Data from online transaction systems and logs entered Talos, was transformed, and written to a real‑time warehouse for downstream OLAP or ad‑hoc queries. An offline pipeline complemented this to produce final correct results because the real‑time path discarded some attributes for stability and resource reasons.

Current Pain Points

High compute cost: Iceberg lacks full streaming semantics, forcing many operations (stream join, deduplication, column updates) into Flink jobs, which consume excessive resources and are less stable.

Complex architecture and poor job stability: The pipeline mixes real‑time tables, Iceberg, Talos, and external KV stores (e.g., HBase, Pegasus), increasing operational overhead and preventing direct OLAP queries.

High storage cost: Duplicate data across real‑time and offline paths and in KV stores leads to expensive storage.

Expectations for a Real‑Time Warehouse

Reduce compute cost by tightly coupling stream processing with the lakehouse.

Simplify architecture and improve stability, ideally using pure SQL on a lakehouse platform.

Unify data pipelines to eliminate redundant development, operations, and data duplication.

Apache Paimon Overview

Paimon is a lakehouse table format similar to Iceberg but with stronger streaming support. It stores data on HDFS or S3, supports AICD (add, insert, change, delete) operations, and enables schema evolution.

Key Capabilities

LSM (Log‑Structured Merge‑Tree) storage providing:

These features allow efficient stream‑side semantics while retaining batch performance.

Use Case 1 – Data Widening with Partial‑Update

Before Paimon, a dual‑stream join consumed two event streams, filtered and transformed them, and performed a join. Large state (TB‑scale) required offloading long‑term data to external KV stores, causing high resource usage, random disk reads, and network overhead.

Paimon’s Partial‑Update merge engine moves the merge logic to a compaction task, turning random reads into sequential reads and eliminating the need for external KV stores. Reported benefits include complete removal of streaming join random reads, consolidation of storage into the lakehouse (saving ~¥50,000 per month on HBase), improved job stability, and simplified SQL‑only logic.

Use Case 2 – Streaming Upsert

Traditional solutions either store raw changelog logs and aggregate at query time (high latency, poor scalability) or perform offline batch imports (low freshness). Iceberg upsert suffers from lack of sorting, excessive file generation, and limited incremental read support.

Paimon’s LSM‑based compaction avoids rewriting historic files, resulting in far lower space amplification (≈16.1 × vs 6.7 × for Iceberg) and more efficient upsert handling. It also offers three changelog producer modes (INPUT, LOOKUP, FULL‑COMPACTION) to suit different latency and data‑volume requirements.

Use Case 3 – Lookup Join for Dimension Tables

Traditional dimension‑table joins rely on HBase or Pegasus, which scale poorly and add cost. Paimon can serve as a lookup source, caching data locally on disk or RocksDB. However, full‑table loading per node and random‑read bottlenecks on HDDs limit scalability.

By applying the same bucket strategy used for streaming tables to the lookup tables, each node loads only its relevant bucket, dramatically reducing load time and disk‑read pressure.

Future Outlook

Planned enhancements include deeper integration of Paimon into stream processing platforms, automated snapshot expiration and TTL scheduling, and a REST catalog API similar to Iceberg’s to broaden usage scenarios. The goal is to promote wider adoption of Paimon for real‑time lakehouse workloads.

Flink Real-time Data Warehouse lakehouse Apache Paimon Streaming Upsert

Written by

Big Data Tech Team

Focuses on big data, data analysis, data warehousing, data middle platform, data science, Flink, AI and interview experience, side‑hustle earning and career planning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.