Big Data 8 min read

How Leading Companies Leverage Apache Paimon for Real‑Time Lakehouse Success

This article summarizes how major tech firms such as Vivo, Shopee, Alibaba, and TikTok adopt Apache Paimon to unify batch and streaming data pipelines, improve latency, reduce costs, and optimize storage, highlighting key challenges, architectural solutions, and real‑world performance gains.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
How Leading Companies Leverage Apache Paimon for Real‑Time Lakehouse Success

Background and Main Problems Solved by Introducing Paimon

Offline Timeliness Issue

Most internal use cases shared by companies follow a Lambda architecture, where the offline batch processing suffers from storage limitations and timeliness; Hive often uses insert overwrite without caring about file organization.

Paimon, as a lake framework, can finely manage each file, offering strong ACID capabilities and streaming writes that enable minute‑level updates.

Real‑Time Link Issues

The real‑time pipeline, mainly based on Flink + MQ, faces several problems:

High cost and operational complexity due to a large Flink ecosystem; intermediate results are not persisted, requiring many dump tasks for troubleshooting and data repair.

Task stability issues caused by stateful computations leading to latency.

Intermediate results are not persisted, necessitating auxiliary tasks for problem diagnosis.

Therefore, Paimon can be qualitatively concluded to solve these problems by unifying batch and streaming links, improving timeliness while reducing cost.

Core Scenarios and Solutions

Unified Data Lake Ingestion

Companies replace the traditional Hive ODS layer with Paimon, using it as a unified mirror table for the entire business database, which improves data link timeliness and optimizes storage space.

Benefits in production:

In the new pipeline, Paimon tables serve as ODS, supporting both stream and batch reads, whereas traditional ODS relies on separate Hive tables and MQ (usually Kafka).

Processing time is reduced from hour‑level to minute‑level, typically within ten minutes.

Paimon supports concurrent writes well and works with both primary‑key and non‑primary‑key tables.

Shopee developed a "daily cut" feature based on Paimon Branch, partitioning data by day to avoid full‑partition redundancy.

The Paimon community also provides tools for schema evolution, allowing MySQL or Kafka data to sync into Paimon and automatically add new columns.

Dimension Table Lookup Join

Paimon primary‑key tables are used as dimension tables in many companies, proven in production.

Dimension table scenarios are divided into two categories: real‑time dimension tables updated via Flink tasks, and offline dimension tables updated by Spark batch tasks (T+1).

Paimon dimension tables support both Flink Streaming SQL and Flink Batch tasks.

Paimon Wide‑Table Scenario

Paimon, like many frameworks, supports partial updates; its LSM‑Tree architecture provides high point‑lookup and merge performance. However, attention is needed for:

Performance bottlenecks when updating massive data scales or many columns, where background merge performance may degrade.

Sequence Group ordering: when multiple streams are concatenated, each stream gets a separate Sequence Group, requiring careful selection of ordering fields, sometimes involving multiple fields.

PV/UV Scenario

In Ant Financial's PV/UV calculation, the original Flink full‑state pipeline was replaced by Paimon due to migration difficulties.

Paimon’s upsert mechanism handles deduplication, and its lightweight changelog log is used to consume data, providing real‑time PV and UV metrics downstream.

The Paimon solution reduces overall CPU usage by 60%, improves checkpoint stability, and shortens rollback and reset times thanks to point‑to‑point writes, simplifying architecture and lowering development costs.

Lake‑Based OLAP

Thanks to tight integration with Spark and Flink, data can be written into Paimon, then Z‑order sorted, clustered, or indexed at the file level; downstream OLAP queries can be performed via Doris or StarRocks, achieving full‑link OLAP capabilities.

Conclusion

The above summarizes the main scenarios where major companies have deployed Paimon; additional use cases will be continuously added.

Reference Documents:

Based on Paimon’s data lake technology in Shopee’s application

Vivo’s lake‑warehouse integrated practice based on Paimon

Apache Paimon real‑time lakehouse storage foundation

Flink x Paimon practice in Douyin Group’s life services

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Batch Processingreal-time dataPaimonLakehouse
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.