Why Daft, Ray, and Lance Are Redefining Multimodal Data Pipelines
This article analyzes how the Daft‑Ray‑Lance stack tackles the challenges of multimodal AI workloads by offering a high‑performance Rust engine, adaptive back‑pressure, seamless Ray‑based distributed scheduling, and a storage format optimized for random access, vector indexing, and zero‑copy schema evolution, complete with benchmark comparisons and practical deployment guidance.
Introduction
Data engineers are seeing a rapid shift from traditional tabular pipelines to multimodal workloads that include images, audio, video, PDFs, and high‑dimensional embeddings, which dramatically increase memory and I/O demands.
1. Daft – A Multimodal Data Engine
1.1 What is Daft?
Daft is an open‑source high‑performance data engine from Eventual, written in Rust with a Python DataFrame API. It targets AI and multimodal workloads rather than trying to replace Pandas or Spark.
1.2 Why not Spark / Pandas / Polars?
Traditional engines struggle with three core issues for multimodal data:
Memory management – decoding a 5 MB JPEG to an RGB tensor can inflate data >20×, causing OOM in Spark partitions.
Resource orchestration – multimodal pipelines need coordinated CPU, GPU, and network I/O, which Spark lacks native GPU scheduling for.
API expressiveness – custom UDFs cannot specify per‑task resources, limiting complex pipelines.
1.3 Architecture
Daft’s execution engine, Swordfish, uses a push‑based, morsel‑level streaming model. Key design decisions include:
Adaptive back‑pressure that automatically throttles upstream I/O when downstream GPU inference becomes a bottleneck.
Single‑worker full‑machine control, allowing each worker to manage CPU, GPU, and network without cross‑process copying.
Native multimodal type system (Image, Tensor, Embedding) with expressions such as
col("image_url").url.download().image.decode(mode=ImageMode.RGB)that avoid extra UDFs.
1.4 Performance
Benchmarks (8 × AWS g6.xlarge, each with an L4 GPU) show Daft outperforming Spark EMR 7.10.0 and Ray Data 2.49.2 on four multimodal workloads:
Workload Daft Ray Data Spark EMR
Audio transcription 6m 22s 29m 20s (4.6× slower) 25m 46s (4.0× slower)
Document embedding 1m 54s 14m 32s (7.6× slower) 8m 4s (4.2× slower)
Image classification 4m 23s 23m 30s (5.4× slower) 45m 7s (10.3× slower)
Video object detection 11m 46s 25m 54s (2.2× slower) 3h 36m (18.4× slower)Daft also completed all workloads without failures, whereas Spark required extensive tuning and Ray Data failed on large batch sizes for document embedding.
2. Ray – The Distributed "Operating System"
2.1 Role in the Stack
Ray (UC Berkeley RISELab, maintained by Anyscale) provides the distributed scheduler. Daft runs in two modes:
# Local mode – single‑machine execution
import daft
df = daft.read_parquet("s3://my-bucket/data/")
df.show()
# Ray mode – distributed execution
import daft
daft.context.set_runner_ray()
df = daft.read_parquet("s3://my-bucket/data/")
df.show()Switching to Ray mode is a single line: daft.context.set_runner_ray(). In Ray mode each Daft worker becomes a Ray actor that owns the full machine resources.
2.2 Why Choose Ray?
Building a fault‑tolerant, resource‑aware scheduler from scratch is costly. Ray offers:
Heterogeneous resource scheduling (CPU, GPU, memory).
Elastic autoscaling with AWS, GCP, or Kubernetes.
Actor model for stateful tasks, useful for loading large model weights once.
Rich ecosystem (Ray Train, Ray Serve, Ray Tune) covering the full ML lifecycle.
2.3 Future – Flotilla Engine
Daft 2025 introduced Flotilla, a lightweight distributed layer that remains Ray‑compatible but can run without Ray for teams that already have a Ray cluster.
3. Lance – An AI‑Optimized Storage Format
3.1 Why a New Format?
Parquet/ORC excel at columnar analytics but lack native support for random access, vector indexing, and multimodal data. Lance (LanceDB) provides a columnar file format, table format, and catalog spec in one, enabling AI‑heavy workloads.
3.2 Key Differences
Feature Parquet Lance
Random access Slow (row‑group scan) ~100× faster (no row groups)
Vector indexing Not supported Native ANN support
Full‑text search Not supported Native support
Data versioning Requires Delta/Iceberg Not built‑in (manifest per commit)
Schema evolution Rewrite whole table Zero‑copy column addition
Multimodal storage URLs only (external) Inline image/tensor/embedding storage3.3 Practical Benefits
Random access is ~100× faster, crucial for ML training that samples rows.
Native ANN indexing removes the need for external vector databases.
Zero‑copy schema evolution lets you add a 1 GB embedding column to a 100 GB table by writing only the new column.
3.4 Complementary Strategy
Lance is not a replacement for Iceberg or Parquet; a dual‑format approach uses Iceberg/Parquet for BI and Lance for AI workloads, sharing the same object storage and catalog.
4. End‑to‑End Architecture
4.1 Overall Stack
Daft DataFrame API ← User layer (Python)
Swordfish / Flotilla ← Execution engine (Rust)
Ray Cluster ← Distributed scheduler
Lance on S3/GCS ← Storage layerDaft reads/writes Lance files directly; Swordfish executes the plan; Ray distributes workers across the cluster.
4.2 Sample Pipeline
import daft
from daft import col
# 1. Switch to Ray distributed mode
daft.context.set_runner_ray()
# 2. Read from Lance dataset
df = daft.read_lance("s3://my-bucket/multimodal-dataset/")
# 3. Basic filtering
df = df.where(
(col("image_url").is_not_null()) &
(col("caption").str.length() > 10)
)
# 4. Download and decode images
df = df.with_column(
"image",
col("image_url").url.download().image.decode(mode=daft.ImageMode.RGB)
)
# 5. Filter low‑resolution images
df = df.where(
(col("image").image.width() >= 224) &
(col("image").image.height() >= 224)
)
# 6. Generate embeddings via a UDF
@daft.udf(return_dtype=daft.DataType.tensor(daft.DataType.float32()))
class EmbeddingModel:
def __init__(self):
from sentence_transformers import SentenceTransformer
self.model = SentenceTransformer("all-MiniLM-L6-v2")
def __call__(self, texts):
return self.model.encode(texts.to_pylist())
df = df.with_column("embedding", EmbeddingModel(col("caption")))
# 7. Select final columns and write back to Lance
df = df.select("image_url", "caption", "embedding")
df.write_lance("s3://my-bucket/curated-dataset/")Key points: url.download() performs asynchronous I/O, keeping CPUs busy. image.decode() is a high‑performance Rust implementation.
The UDF class loads the model once per worker, reusing weights.
The pipeline streams data morsel‑by‑morsel, avoiding full‑dataset materialization.
4.3 Suitable Scenarios
Multimodal dataset curation (deduplication, quality filtering, annotation).
Large‑scale embedding generation and vector search.
Incremental feature engineering with zero‑copy column addition.
High‑throughput ML training data serving (500 M+ IOPS possible).
5. Comparative Overview
Dimension Spark+Parquet Ray Data Daft+Ray+Lance
Multimodal support No Partial Native (Image/Tensor/Embedding)
Execution model Batch (partition materialization) Stream (object‑store spill) Stream with morsel‑level back‑pressure
Vector search External system External system Native ANN
GPU scheduling Weak (manual) Declarative Strong (worker‑level auto)
Learning curve High (JVM) Medium (low‑level API) Low (Pandas‑style API)
Data versioning Requires Delta/Iceberg None Built‑in manifest versioning
Ecosystem maturity Very high High Emerging (rapid growth)
Community size Huge Large Medium (fast‑growing stars)6. Practical Recommendations & Caveats
6.1 Daft Maturity
Daft is still young; its community, documentation, and third‑party integrations lag behind Spark. For pure SQL/ETL workloads, switching may not be justified yet.
6.2 Ray Operational Cost
Running a Ray cluster adds deployment overhead. Small teams may stay with Daft’s local mode until data size demands scaling.
6.3 Lance Ecosystem Coverage
Current compute integrations focus on Daft, Spark (via lance‑spark), and Ray (via lance‑ray). Major OLAP engines (Trino, StarRocks, Flink) still have early‑stage support.
6.4 Suggested Adoption Path
Install Daft locally ( pip install daft) and run a few multimodal pipelines.
Convert a subset of Parquet data to Lance and benchmark random‑access speed.
When data grows, enable Ray with daft.context.set_runner_ray().
Adopt incrementally: start with AI/ML pipelines before replacing the entire stack.
7. Conclusion
The multimodal AI era demands a stack that can handle massive, heterogeneous data without OOM failures. Daft offers a Rust‑based, back‑pressure‑aware engine; Ray provides scalable, GPU‑aware scheduling; and Lance delivers an AI‑centric storage format with fast random access, native vector indexing, and zero‑copy schema evolution. Together they form a compelling, open‑source alternative to traditional Spark‑centric pipelines for modern data engineering.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
