Big Data 6 min read

Introduction to Apache Paimon: Architecture, Unified Storage, and Core Concepts

This article introduces Apache Paimon, an open‑source table format that supports batch and streaming reads and writes, explains its architecture, unified storage model, and core concepts such as file layout, snapshots, manifests, data files, partitions, and consistency guarantees.

Big Data Technology & Architecture

Oct 12, 2024

Introduction to Apache Paimon: Architecture, Unified Storage, and Core Concepts

1. Understanding Paimon

Apache Paimon is an emerging open‑source table format that supports both batch and streaming reads and writes, enabling OLAP queries on large‑scale data.

Its architecture allows reading from historical snapshots, latest offsets, or a hybrid incremental snapshot, and writing via CDC streams or bulk inserts.

The ecosystem integrates with Apache Flink, Hive, Spark, Trino and other compute engines.

Internally, Paimon stores columnar files on a file system or object storage, keeps metadata in manifest files for efficient pruning, and uses an LSM‑tree for primary‑key tables to support high‑performance updates.

2. Unified Storage

For stream engines like Flink, three connector types are typical: message queues (e.g., Kafka) for low‑latency ingestion, OLAP systems (e.g., ClickHouse) for ad‑hoc queries, and batch stores (e.g., Hive) for traditional batch operations.

Paimon provides a table abstraction that behaves like a Hive table in batch mode and like a never‑expiring message queue in streaming mode.

3. Core Concepts

1. File Layout

All files of a table reside under a base directory and are organized hierarchically, allowing recursive access from snapshot files.

2. Snapshot

Snapshot files (JSON) reside in a snapshot directory and record the active schema and a list of data‑file manifests, enabling point‑in‑time reads and time‑travel queries.

3. Manifest Files

Manifests and manifest lists are stored in a manifest directory; they enumerate LSM data files and change‑log files associated with each snapshot.

4. Data Files

Data files are partitioned and can be stored in ORC (default), Parquet, or Avro formats.

5. Partitions

Paimon adopts the same partition concept as Apache Hive, allowing optional partition keys (e.g., date, city) to improve query efficiency.

6. Consistency Guarantees

Writes use a two‑phase commit protocol, producing up to two snapshots per commit; concurrent writers on different partitions can commit in parallel, while writers on the same partition receive snapshot isolation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Streaming OLAP Apache Paimon Table Storage

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.