Apache Paimon: Boosting Real-Time Data Lakes for Fraud Detection & Manufacturing
This article examines Apache Paimon’s innovative lakehouse architecture, detailing its LSM‑Tree storage, flexible merge engine, and multi‑engine integration, and showcases two real‑world deployments—an operator’s real‑time fraud‑prevention system and a manufacturing firm’s unified data platform—highlighting performance gains and cost reductions.
Introduction
Real‑time data processing is a critical capability for digital transformation. Traditional architectures separate streaming and batch workloads, leading to high latency, OOM failures, and duplicated storage. Apache Paimon provides a unified storage format that combines the low‑cost scalability of data lakes with the update and query efficiency of databases, enabling a true lakehouse where a single dataset serves both streaming and batch use cases.
Apache Paimon Overview
Apache Paimon is a table format built on an LSM‑Tree (Log‑Structured Merge‑Tree) storage engine. The LSM‑Tree optimizes high‑frequency writes and fast point queries. Data is stored in columnar file formats such as Parquet or ORC, and each file carries rich indexes (min‑max, Bloom filter, bitmap) that support data skipping and improve query performance.
Core Features
Unified streaming‑batch storage : No separate layers for stream and batch; all data lives in the same Paimon tables.
Flexible Merge Engine : Users define a merge strategy at table creation, and the engine pushes the logic down to the storage layer. Supported strategies include:
Deduplication merge – keeps the latest record for a primary key.
Column‑update merge – incremental column updates, useful for multi‑stream joins.
Pre‑aggregation merge – aggregates fields with the same primary key at write time.
First‑row retain merge – keeps the first record for a primary key.
Intelligent data management : Automatic small‑file merging, Z‑Order data layout optimization, and snapshot isolation (MVCC) to avoid read/write conflicts.
Multi‑engine ecosystem support : Deep integration with Apache Flink for streaming, and read/write compatibility with Apache Spark, Hive, Trino, and Doris.
Best‑Practice Cases
1. Real‑time fraud‑prevention for a telecom operator
Background : The operator needed to shift from post‑event fraud analysis to real‑time blocking, processing 2.4 billion SMS and 400 million voice records over a seven‑day window.
Challenges
Dual‑stream joins between real‑time flow and massive historical tables caused OOM and latency >5 minutes.
Full scans could not meet the 5‑minute SLA for blocking.
Solution
Replace the Hudi/Hive latency pipeline with a unified Flink + Paimon lakehouse.
Use the Aggregation Merge Engine (pre‑aggregation) to store only aggregated metrics (e.g., per‑caller daily send count and distinct counterpart set) instead of raw detail rows.
Define an aggregation table with caller and date as primary keys; fields send_cnt (SUM) and counterpart_set (LIST) are merged on write.
Write‑time aggregation automatically merges records for the same caller, reducing 2.1 billion detail rows to tens of millions of metric rows.
Results
Task stability: No OOM, because only aggregated metrics are processed.
Latency: End‑to‑end processing stabilized under 2 minutes, enabling near‑real‑time fraud interception.
2. Unified lakehouse for a precision‑manufacturing enterprise
Background : The manufacturer required millisecond‑level traceability across >160 MES systems, supporting 5‑minute near‑real‑time dashboards and 30‑minute offline warehouses.
Challenges
Historical updates could modify months‑old orders, but Hive’s append‑only model caused data bloat and inaccurate traceability.
Separate Hive and Doris layers duplicated storage and introduced synchronization delays.
Hive batch jobs produced >30 minute latency and frequent FileNotFoundException due to concurrent INSERT OVERWRITE operations.
Solution
Adopt Apache Paimon as the single storage layer, replacing Hive.
Expose Paimon tables as external Doris tables, eliminating data movement.
Key Paimon features used:
Primary‑key tables with Deduplicate Merge Engine for real‑time upserts, removing duplicate rows.
Snapshot isolation (MVCC) to prevent read/write “stepping on foot” errors.
Hive Catalog integration for seamless migration of existing Hive/Spark jobs.
Doris federation query to read Paimon tables directly.
Results
Latency reduced from >30 minutes to ~5 minutes across the whole pipeline.
Storage cost halved by eliminating duplicate Hive/Doris data.
Task stability improved: snapshot mechanism removed file‑not‑found exceptions.
Data accuracy: Full historical updates are reflected instantly, ensuring complete traceability.
Conclusion
Apache Paimon’s LSM‑Tree‑based index, versatile Merge Engine, and snapshot isolation overcome the performance bottlenecks of traditional lakehouses for real‑time updates and high‑concurrency queries. By pushing merge logic to the storage layer, Paimon improves resource utilization, reduces latency, and provides a solid foundation for building high‑performance, low‑cost real‑time lakehouse systems.
References
Apache Paimon Project Team. Apache Paimon 1.1 Official Documentation, 2024. https://paimon.apache.org/docs/1.1/
Li Jinsong. "Apache Paimon Real‑time Lakehouse Storage Base", Zhihu Column, 2024. https://zhuanlan.zhihu.com/p/715770177
AsiaInfo Technology: New Tech Exploration
AsiaInfo's cutting‑edge ICT viewpoints and industry insights, featuring its latest technology and product case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
