Revolutionizing AI Data Lakes: How Daft + Lance Enable Multimodal Processing
This article explores how the LAS team's AI‑driven data lake solution, built on Daft for lake computing and Lance for lake storage, tackles the emerging challenges of multimodal data handling, offering faster I/O, heterogeneous CPU‑GPU scheduling, and seamless integration for AI workloads.
AI‑Driven Data Lake Transformation
In the era of rapid AI advancement, data serves as the "fuel" for AI, prompting profound changes in its form and processing methods.
The content originates from the LAS team’s presentation at the 2025 AICon Global AI Development and Application Conference, focusing on "Multimodal Data Processing in AI Scenarios" and introducing a multimodal data lake solution based on Daft and Lance.
Challenges and Shifts in Data Lake Scenarios
Traditional data lakes concentrate on structured data for big data, but AI scenarios demand new requirements across storage, compute, and data management. Storage must handle increasingly diverse sources, including multimodal data such as images and videos, while supporting unified storage for structured and unstructured data. Unlike big data, which emphasizes cost reduction, AI prioritizes data read I/O speed to support efficient model training and inference.
On the compute side, AI introduces heterogeneous CPU‑GPU workloads driven by models. Business control shifts toward AI, moving data processing from SQL to Dataframe‑centric approaches. Data management now includes files, functions, and other entities beyond traditional databases and tables, with applications expanding toward agents and embodied intelligence.
LAS: LakeHouse AI Service (LAS)
LAS proposes a comprehensive AI data lake comprising three modules: lake compute, lake storage, and lake management. The focus here is on lake compute (Daft) and lake storage (Lance).
Daft: Lake Compute Engine for Multimodal Data
Daft, built on Ray, addresses four core needs in AI scenarios:
Scalable from single‑node to distributed execution : Quickly expand from debugging on a single machine to large‑scale distributed processing.
Unified handling of multimodal and structured data : Process images, videos, and tabular data within a single framework.
Heterogeneous CPU‑GPU scheduling : Coordinate CPU and GPU operators in one workflow, maximizing hardware utilization.
Bridging big‑data and AI teams : Enable collaboration between SQL‑oriented big‑data engineers and Dataframe‑focused AI developers.
Daft’s architecture leverages Ray for distributed scaling, integrates big‑data optimizations, and implements a native Rust execution engine while preserving Python ecosystem familiarity (e.g., Pandas, Polars).
Key capabilities include:
Distributed conversion of Python scripts, offering stateless (task) and stateful (class) UDFs to reduce model loading overhead.
CPU‑GPU heterogeneous scheduling for seamless data preparation and model training.
Lazy multimodal computation: data is referenced by URL or row ID and only fetched when needed, dramatically cutting unnecessary I/O and memory usage.
Support for multimodal data types such as video key‑frame extraction and image resizing without loading full assets.
Lance: Lake Storage Solution for Multimodal Data
Lance addresses three core storage challenges:
Columnar storage for multimodal data : Achieves high compression (e.g., 100 GB Tensor data reduced to 2 GB).
Unified storage of large‑column data : Stores images and their metadata together, enabling fast point‑lookup for training.
Zero‑copy schema evolution : Allows schema changes without costly data copying.
Practical Validation
Autonomous Driving Scenario
Previous solution used Argo + K8s scheduling with LMDB storage, suffering from lack of CPU‑GPU heterogeneity and heavy disk I/O. Replacing it with Daft on Ray and Lance reduced end‑to‑end processing time by 70%, improving data handling and model iteration efficiency.
LLM Image‑Text Mixing Scenario
Customers processing web‑scraped image‑text data faced unstable large joins with Spark. Daft + Lance eliminated the need for massive joins by linking data via row IDs and loading assets on demand, solving stability issues and significantly speeding up processing.
Future Roadmap
The LAS team will continue to enhance Daft with broader multimodal support (e.g., video processing for autonomous driving) and integrate industry‑specific data types such as LeRobot and MCP. Further synergy with Lance will strengthen the "compute + storage" collaboration, and the community is invited to join the Daft and Lance Chinese open‑source communities.
For more details, readers are encouraged to follow the official Daft and Lance Chinese public accounts.
ByteDance Data Platform
The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
