How Leading Companies Leverage Apache Paimon for Real‑Time Lakehouse Success
This article summarizes how major tech firms such as Vivo, Shopee, Alibaba, and TikTok adopt Apache Paimon to unify batch and streaming data pipelines, improve latency, reduce costs, and optimize storage, highlighting key challenges, architectural solutions, and real‑world performance gains.
Background and Main Problems Solved by Introducing Paimon
Offline Timeliness Issue
Most internal use cases shared by companies follow a Lambda architecture, where the offline batch processing suffers from storage limitations and timeliness; Hive often uses insert overwrite without caring about file organization.
Paimon, as a lake framework, can finely manage each file, offering strong ACID capabilities and streaming writes that enable minute‑level updates.
Real‑Time Link Issues
The real‑time pipeline, mainly based on Flink + MQ, faces several problems:
High cost and operational complexity due to a large Flink ecosystem; intermediate results are not persisted, requiring many dump tasks for troubleshooting and data repair.
Task stability issues caused by stateful computations leading to latency.
Intermediate results are not persisted, necessitating auxiliary tasks for problem diagnosis.
Therefore, Paimon can be qualitatively concluded to solve these problems by unifying batch and streaming links, improving timeliness while reducing cost.
Core Scenarios and Solutions
Unified Data Lake Ingestion
Companies replace the traditional Hive ODS layer with Paimon, using it as a unified mirror table for the entire business database, which improves data link timeliness and optimizes storage space.
Benefits in production:
In the new pipeline, Paimon tables serve as ODS, supporting both stream and batch reads, whereas traditional ODS relies on separate Hive tables and MQ (usually Kafka).
Processing time is reduced from hour‑level to minute‑level, typically within ten minutes.
Paimon supports concurrent writes well and works with both primary‑key and non‑primary‑key tables.
Shopee developed a "daily cut" feature based on Paimon Branch, partitioning data by day to avoid full‑partition redundancy.
The Paimon community also provides tools for schema evolution, allowing MySQL or Kafka data to sync into Paimon and automatically add new columns.
Dimension Table Lookup Join
Paimon primary‑key tables are used as dimension tables in many companies, proven in production.
Dimension table scenarios are divided into two categories: real‑time dimension tables updated via Flink tasks, and offline dimension tables updated by Spark batch tasks (T+1).
Paimon dimension tables support both Flink Streaming SQL and Flink Batch tasks.
Paimon Wide‑Table Scenario
Paimon, like many frameworks, supports partial updates; its LSM‑Tree architecture provides high point‑lookup and merge performance. However, attention is needed for:
Performance bottlenecks when updating massive data scales or many columns, where background merge performance may degrade.
Sequence Group ordering: when multiple streams are concatenated, each stream gets a separate Sequence Group, requiring careful selection of ordering fields, sometimes involving multiple fields.
PV/UV Scenario
In Ant Financial's PV/UV calculation, the original Flink full‑state pipeline was replaced by Paimon due to migration difficulties.
Paimon’s upsert mechanism handles deduplication, and its lightweight changelog log is used to consume data, providing real‑time PV and UV metrics downstream.
The Paimon solution reduces overall CPU usage by 60%, improves checkpoint stability, and shortens rollback and reset times thanks to point‑to‑point writes, simplifying architecture and lowering development costs.
Lake‑Based OLAP
Thanks to tight integration with Spark and Flink, data can be written into Paimon, then Z‑order sorted, clustered, or indexed at the file level; downstream OLAP queries can be performed via Doris or StarRocks, achieving full‑link OLAP capabilities.
Conclusion
The above summarizes the main scenarios where major companies have deployed Paimon; additional use cases will be continuously added.
Reference Documents:
Based on Paimon’s data lake technology in Shopee’s application
Vivo’s lake‑warehouse integrated practice based on Paimon
Apache Paimon real‑time lakehouse storage foundation
Flink x Paimon practice in Douyin Group’s life services
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
