Big Data 11 min read

How Paimon’s Column‑Separation Architecture Powers Real‑Time Multi‑Modal Lakehouse for AI

This article explains the challenges of frequent column changes in AI feature engineering, introduces Paimon’s column‑separation storage with a global continuous Row ID, details its Blob data type for efficient multi‑modal handling, and outlines production results and future roadmap for building an AI‑native data lakehouse.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How Paimon’s Column‑Separation Architecture Powers Real‑Time Multi‑Modal Lakehouse for AI

1. Column Change Challenges in Structured Scenarios

In AI applications such as recommendation and advertising, feature engineering evolves continuously, leading to frequent addition of new columns (e.g., "user click‑category distribution in the last 7 days" or "cross‑device behavior consistency score"). This causes a "column explosion" where table schemas expand rapidly while historical data must align with new features.

Existing solutions have drawbacks:

Primary‑key partial‑update: Supports updating specific columns by primary key, but the LSM‑tree implementation generates many small files under high write rates, sharply degrading query performance; compaction merges files but incurs several times temporary storage overhead.

ODPS new‑feature table + Join: Writes new features to a separate table and joins on primary key at query time. While avoiding rewrites, the join operation becomes prohibitively expensive on petabyte‑scale data and is hard to optimize.

Append table + MERGE INTO: Simple SQL syntax, yet still rewrites entire data files. For daily PB‑scale training sets, full rewrites are costly and significantly slow feature rollout.

All these approaches fail to decouple the physical storage of columns, limiting flexibility and efficiency.

2. Paimon’s Column‑Separation Architecture with Global Row ID

Paimon introduces a column‑separation storage architecture centered on a global unique and continuous Row ID . Each row receives an immutable ID at first write, and Row IDs are sequential within each data file; metadata records the starting Row ID of each file.

This design provides two key capabilities:

Precise row location: The Row ID directly maps to the specific file and offset, enabling fast access.

Cross‑file automatic association: When a query involves multiple columns, the engine uses Row ID ranges to merge column data stored in different files at the storage layer.

For example, adding a new "user interest tag" column only requires writing a new file containing that column and its Row IDs, without modifying existing feature files. During query execution, the engine transparently aligns files by Row ID, eliminating the need for SQL‑level joins or rewriting historical data, reducing column‑change storage cost from O(N) to O(ΔN) and dramatically improving feature iteration speed.

3. Blob Data Type for Multi‑Modal Data

AI training increasingly relies on non‑structured data such as images, short videos, and long audio, which vary widely in size (MB to tens of GB) and are accessed sparsely (often only small fragments are read). Traditional columnar formats like Parquet intermix multi‑modal blobs with structured fields, forcing the loading of entire large files even when only a user ID is needed, resulting in poor I/O efficiency.

Paimon’s Blob data type addresses this with three breakthroughs:

Physical separation storage: Blob columns are stored as independent files, completely decoupled from structured data, so queries on structured fields do not involve Blob I/O.

Unified engine abstraction: All compute engines (Spark, Flink, Java SDK, Python client) define Blob fields using the same type identifiers, e.g., BYTES or BINARY, simplifying integration.

Blob‑as‑descriptor mechanism: For very large unstructured objects (e.g., multi‑GB videos), the system records external storage metadata (location, path, offset, length) and streams the data on demand, avoiding memory overflow and enabling efficient lake ingestion.

These capabilities allow efficient handling of multi‑modal data without sacrificing query performance.

4. Production Validation and Future Roadmap

Paimon Blob is already deployed at scale in core Alibaba businesses such as Taobao and Tmall, ingesting nearly 10 PB of multi‑modal data daily via the Blob Descriptor protocol, which prevents Flink or Spark from loading whole large files into memory.

Current operational challenges include:

Data duplication and deletion: Repeated uploads generate massive redundant data (estimated ~1 PB/day), requiring effective deduplication and deletion mechanisms.

Small‑file fragmentation: Frequent tiny writes produce a large number of micro‑Blob files, degrading read performance and storage efficiency.

Point‑lookup latency: Lack of fast primary‑key or vector‑based indexing hampers millisecond‑level real‑time queries.

Planned evolutions:

Point‑lookup performance optimization: Introduce a global indexing framework supporting both scalar (string, numeric) and vector indexes for AI recall; the scalar index is expected in the open‑source master branch this month.

Multi‑modal data management: Implement Deletion Vector + placeholder for safe logical deletion during compaction, and develop a Blob Compaction mechanism to automatically merge small files, improving read performance and storage density.

Cross‑table Blob reuse: Enable multiple tables to reference the same physical Blob (e.g., a video), reducing storage duplication, though it introduces consistency challenges that are slated for long‑term optimization.

Conclusion

Paimon’s evolution—from column‑separation for structured scenarios to Blob abstraction for multi‑modal data—originates from real business pain points and continuously enhances engineering efficiency. It has become an AI‑native data operating system that is efficient, flexible, and intelligent, providing a unified foundation for large‑scale AIGC and recommendation workloads.

Big DataColumnar StoragelakehouseBlobApache Paimonrow_idMulti-Modal Data
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.