How OpenLake Redefines Data Lake Infrastructure for the AI Era
This article explores OpenLake's evolution as a data lake platform for AI, covering the transition from Hive to modern lake formats like Iceberg and Paimon, performance benchmarks, metadata management advances, intelligent storage optimization, and the integration of multimodal support with the Lance file format.
Background Introduction
Current data trends merge data lakes and warehouses into a lake‑warehouse architecture that includes compute engines, a meta‑warehouse, unified lake‑format storage and object storage (OSS). The industry is moving beyond simple lake‑warehouse to integrate AI data, search data and other modalities into a larger unified lake with consistent table, file, metadata and storage management.
Data Lake Infrastructure Evolution
Hive uses shared file storage for batch processing, managing folders for partitions and tables and storing files such as ORC, Parquet, CSV and JSON. Its management is loose, leading to a simple architecture but limited capabilities: only Insert Overwrite, poor consistency, and no fine‑grained updates.
From 2019‑2020 the Iceberg community focused on fixing Hive’s file‑management shortcomings, providing fine‑grained file handling. Other lake formats—Delta Lake, Hudi and Paimon—also manage files at a fine granularity.
Fine‑grained management enables ACID transactions, clear file addition/removal, atomic commits, conflict detection, and supports Delete and Update. Compared with CDC‑style full‑partition rewrites, lake formats allow lightweight updates by rewriting only the affected files, creating snapshots and new commits.
By 2022 Paimon (originating from Flink Table Store) enhanced Iceberg’s capabilities, adding streaming processing, supporting Merge Into, Delete, Update and schema evolution. Paimon can act as a message queue, enabling near‑real‑time minute‑level processing and supporting OLAP and AI workloads.
1. Hive to Lake‑Format Evolution: Batch Updates
Hive’s Insert Overwrite requires reading and writing all data, creating full copies and wasting resources. Lake formats such as Iceberg, Delta Lake, Hudi and Paimon support Merge Into, avoiding full rewrites, improving storage efficiency, and offering versioning and time‑travel capabilities.
Lake formats also introduce overhead of many small files and version management, requiring maintenance to avoid storage bloat.
2. Iceberg to Paimon Evolution: Streaming Updates
Iceberg primarily supports batch updates; Paimon adds streaming updates from sources like Kafka or Flink CDC, enabling real‑time synchronization, schema evolution and low‑cost streaming sync.
Streaming updates bring challenges such as defining bucket numbers for performance; DLF provides adaptive bucketing and compaction to handle this.
3. Performance Testing on OpenLake
Benchmarks show Paimon outperforms Hudi and Iceberg in both streaming tests and batch TPC‑DS tests on OpenLake.
4. Metadata Management Evolution
Hive Metastore (HMS) is the industry standard but lacks AI capabilities and unified auditing. New catalogs such as Snowflake Polaris, Gravitino, Unity Catalog and others aim to fill these gaps.
DLF Data Lake Platform: OpenLake Storage Foundation
DLF builds on OpenLake OSS and uses Apache Paimon as the lake format, supporting Parquet, ORC, Avro and the upcoming Lance format for multimodal storage. It provides catalog, database, table, view, function and volume entities, and integrates with various compute engines (EMR, ECS, Python, PyArrow, Ray, DuckDB, etc.).
Metadata management offers a unified interface, role‑based access control and intelligent storage optimization (adaptive bucketing and compaction). Users do not need to manage buckets or compaction manually.
1. Data Lake Formation
DLF offers a data‑lake‑management platform with catalog, database, table, view, function and volume, supporting both internal (Paimon) and external (Hive‑compatible) tables.
2. Paimon REST Catalog API
Paimon 1.1 introduces a standard open‑source SDK for Java, C++ and Rust, enabling cross‑language integration and a REST API for table operations.
3. Catalog Hierarchy
Catalog → Database → Entities (Paimon Table, Format Table, View, Function). Paimon Table supports primary‑key and non‑primary‑key tables; Format Table maps to external Hive table formats.
4. DLF Paimon Intelligent Storage Optimization
Compaction can run at write time or as an independent task; DLF disables write‑side compaction and handles it centrally via Kafka‑driven background services that generate compaction plans automatically. Key features include adaptive bucketing and adaptive merging.
Intelligent Adaptive Bucketing
Latest Paimon supports postponed bucket: writes go to a temporary area, DLF determines bucket size/count and moves data, reducing user concerns about bucket configuration and compaction stability.
5. Performance Comparison: DLF Managed vs OSS Self‑Built
DLF provides ~10 ms metadata access latency (10× faster), >15 % query‑performance improvement, >30 % storage‑cost reduction and roughly 2× query speed compared with OSS self‑built solutions.
Multimodal Data Lake
Parquet, while dominant for columnar storage, struggles with large unstructured data (audio, video) and random‑access workloads required by machine‑learning training. It can cause OOM for 100‑200 MB blobs and incurs ~90 % wasted I/O for random sampling.
The Lance format addresses these challenges with three breakthroughs: native multimodal storage for large blobs, O(1) random‑access via a global index, and an append‑only column architecture with versioning for schema evolution.
Lance stores data in a single file without row‑group buffering, avoiding memory‑intensive operations and OOM risks. Random access uses offset and ID lookups, sacrificing some compression for point‑query speed.
Lance enables column addition without rewriting the whole table by storing each column in separate files; a simple create‑append‑add‑column operation adds new columns efficiently.
Integration of Paimon and Lance in DLF will provide multimodal support, fast random access, large‑field storage and AI‑friendly capabilities, with an initial release expected in July.
For more details, visit https://www.aliyun.com/product/bigdata/dlf .
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.