Big Data 22 min read

How OpenLake Redefines Data Lake Infrastructure for the AI Era

This article explores OpenLake's evolution as a data lake platform for AI, covering the transition from Hive to modern lake formats like Iceberg and Paimon, performance benchmarks, metadata management advances, intelligent storage optimization, and the integration of multimodal support with the Lance file format.

DataFunSummit

Jun 10, 2025

How OpenLake Redefines Data Lake Infrastructure for the AI Era

Background Introduction

Current data trends merge data lakes and warehouses into a lake‑warehouse architecture that includes compute engines, a meta‑warehouse, unified lake‑format storage and object storage (OSS). The industry is moving beyond simple lake‑warehouse to integrate AI data, search data and other modalities into a larger unified lake with consistent table, file, metadata and storage management.

Data Lake Infrastructure Evolution

Hive uses shared file storage for batch processing, managing folders for partitions and tables and storing files such as ORC, Parquet, CSV and JSON. Its management is loose, leading to a simple architecture but limited capabilities: only Insert Overwrite, poor consistency, and no fine‑grained updates.

From 2019‑2020 the Iceberg community focused on fixing Hive’s file‑management shortcomings, providing fine‑grained file handling. Other lake formats—Delta Lake, Hudi and Paimon—also manage files at a fine granularity.

Fine‑grained management enables ACID transactions, clear file addition/removal, atomic commits, conflict detection, and supports Delete and Update. Compared with CDC‑style full‑partition rewrites, lake formats allow lightweight updates by rewriting only the affected files, creating snapshots and new commits.

By 2022 Paimon (originating from Flink Table Store) enhanced Iceberg’s capabilities, adding streaming processing, supporting Merge Into, Delete, Update and schema evolution. Paimon can act as a message queue, enabling near‑real‑time minute‑level processing and supporting OLAP and AI workloads.

1. Hive to Lake‑Format Evolution: Batch Updates

Hive’s Insert Overwrite requires reading and writing all data, creating full copies and wasting resources. Lake formats such as Iceberg, Delta Lake, Hudi and Paimon support Merge Into, avoiding full rewrites, improving storage efficiency, and offering versioning and time‑travel capabilities.

Lake formats also introduce overhead of many small files and version management, requiring maintenance to avoid storage bloat.

2. Iceberg to Paimon Evolution: Streaming Updates

Iceberg primarily supports batch updates; Paimon adds streaming updates from sources like Kafka or Flink CDC, enabling real‑time synchronization, schema evolution and low‑cost streaming sync.

Streaming updates bring challenges such as defining bucket numbers for performance; DLF provides adaptive bucketing and compaction to handle this.

3. Performance Testing on OpenLake

Benchmarks show Paimon outperforms Hudi and Iceberg in both streaming tests and batch TPC‑DS tests on OpenLake.

4. Metadata Management Evolution

Hive Metastore (HMS) is the industry standard but lacks AI capabilities and unified auditing. New catalogs such as Snowflake Polaris, Gravitino, Unity Catalog and others aim to fill these gaps.

DLF Data Lake Platform: OpenLake Storage Foundation

DLF builds on OpenLake OSS and uses Apache Paimon as the lake format, supporting Parquet, ORC, Avro and the upcoming Lance format for multimodal storage. It provides catalog, database, table, view, function and volume entities, and integrates with various compute engines (EMR, ECS, Python, PyArrow, Ray, DuckDB, etc.).

Metadata management offers a unified interface, role‑based access control and intelligent storage optimization (adaptive bucketing and compaction). Users do not need to manage buckets or compaction manually.

1. Data Lake Formation

DLF offers a data‑lake‑management platform with catalog, database, table, view, function and volume, supporting both internal (Paimon) and external (Hive‑compatible) tables.

2. Paimon REST Catalog API

Paimon 1.1 introduces a standard open‑source SDK for Java, C++ and Rust, enabling cross‑language integration and a REST API for table operations.

3. Catalog Hierarchy

Catalog → Database → Entities (Paimon Table, Format Table, View, Function). Paimon Table supports primary‑key and non‑primary‑key tables; Format Table maps to external Hive table formats.

4. DLF Paimon Intelligent Storage Optimization

Compaction can run at write time or as an independent task; DLF disables write‑side compaction and handles it centrally via Kafka‑driven background services that generate compaction plans automatically. Key features include adaptive bucketing and adaptive merging.

Intelligent Adaptive Bucketing

Latest Paimon supports postponed bucket: writes go to a temporary area, DLF determines bucket size/count and moves data, reducing user concerns about bucket configuration and compaction stability.

5. Performance Comparison: DLF Managed vs OSS Self‑Built

DLF provides ~10 ms metadata access latency (10× faster), >15 % query‑performance improvement, >30 % storage‑cost reduction and roughly 2× query speed compared with OSS self‑built solutions.

Multimodal Data Lake

Parquet, while dominant for columnar storage, struggles with large unstructured data (audio, video) and random‑access workloads required by machine‑learning training. It can cause OOM for 100‑200 MB blobs and incurs ~90 % wasted I/O for random sampling.

The Lance format addresses these challenges with three breakthroughs: native multimodal storage for large blobs, O(1) random‑access via a global index, and an append‑only column architecture with versioning for schema evolution.

Lance stores data in a single file without row‑group buffering, avoiding memory‑intensive operations and OOM risks. Random access uses offset and ID lookups, sacrificing some compression for point‑query speed.

Lance enables column addition without rewriting the whole table by storing each column in separate files; a simple create‑append‑add‑column operation adds new columns efficiently.

Integration of Paimon and Lance in DLF will provide multimodal support, fast random access, large‑field storage and AI‑friendly capabilities, with an initial release expected in July.

For more details, visit https://www.aliyun.com/product/bigdata/dlf .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data AI Metadata Paimon Data Lake OpenLake

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background Introduction

Data Lake Infrastructure Evolution

1. Hive to Lake‑Format Evolution: Batch Updates

2. Iceberg to Paimon Evolution: Streaming Updates

3. Performance Testing on OpenLake

4. Metadata Management Evolution

DLF Data Lake Platform: OpenLake Storage Foundation

1. Data Lake Formation

2. Paimon REST Catalog API

3. Catalog Hierarchy

4. DLF Paimon Intelligent Storage Optimization

Intelligent Adaptive Bucketing

5. Performance Comparison: DLF Managed vs OSS Self‑Built

Multimodal Data Lake

DataFunSummit

How this landed with the community

Was this worth your time?

0 Comments

4. DLF Paimon Intelligent Storage Optimization