Industry Insights 21 min read

Why Daft, Ray, and Lance Are Redefining Multimodal Data Pipelines

This article analyzes how the Daft‑Ray‑Lance stack tackles the challenges of multimodal AI workloads by offering a high‑performance Rust engine, adaptive back‑pressure, seamless Ray‑based distributed scheduling, and a storage format optimized for random access, vector indexing, and zero‑copy schema evolution, complete with benchmark comparisons and practical deployment guidance.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Why Daft, Ray, and Lance Are Redefining Multimodal Data Pipelines

Introduction

Data engineers are seeing a rapid shift from traditional tabular pipelines to multimodal workloads that include images, audio, video, PDFs, and high‑dimensional embeddings, which dramatically increase memory and I/O demands.

1. Daft – A Multimodal Data Engine

1.1 What is Daft?

Daft is an open‑source high‑performance data engine from Eventual, written in Rust with a Python DataFrame API. It targets AI and multimodal workloads rather than trying to replace Pandas or Spark.

1.2 Why not Spark / Pandas / Polars?

Traditional engines struggle with three core issues for multimodal data:

Memory management – decoding a 5 MB JPEG to an RGB tensor can inflate data >20×, causing OOM in Spark partitions.

Resource orchestration – multimodal pipelines need coordinated CPU, GPU, and network I/O, which Spark lacks native GPU scheduling for.

API expressiveness – custom UDFs cannot specify per‑task resources, limiting complex pipelines.

1.3 Architecture

Daft’s execution engine, Swordfish, uses a push‑based, morsel‑level streaming model. Key design decisions include:

Adaptive back‑pressure that automatically throttles upstream I/O when downstream GPU inference becomes a bottleneck.

Single‑worker full‑machine control, allowing each worker to manage CPU, GPU, and network without cross‑process copying.

Native multimodal type system (Image, Tensor, Embedding) with expressions such as

col("image_url").url.download().image.decode(mode=ImageMode.RGB)

that avoid extra UDFs.

1.4 Performance

Benchmarks (8 × AWS g6.xlarge, each with an L4 GPU) show Daft outperforming Spark EMR 7.10.0 and Ray Data 2.49.2 on four multimodal workloads:

Workload                Daft      Ray Data          Spark EMR
Audio transcription    6m 22s   29m 20s (4.6× slower) 25m 46s (4.0× slower)
Document embedding      1m 54s   14m 32s (7.6× slower) 8m 4s (4.2× slower)
Image classification    4m 23s   23m 30s (5.4× slower) 45m 7s (10.3× slower)
Video object detection 11m 46s  25m 54s (2.2× slower) 3h 36m (18.4× slower)

Daft also completed all workloads without failures, whereas Spark required extensive tuning and Ray Data failed on large batch sizes for document embedding.

2. Ray – The Distributed "Operating System"

2.1 Role in the Stack

Ray (UC Berkeley RISELab, maintained by Anyscale) provides the distributed scheduler. Daft runs in two modes:

# Local mode – single‑machine execution
import daft
df = daft.read_parquet("s3://my-bucket/data/")
df.show()

# Ray mode – distributed execution
import daft
daft.context.set_runner_ray()
df = daft.read_parquet("s3://my-bucket/data/")
df.show()

Switching to Ray mode is a single line: daft.context.set_runner_ray(). In Ray mode each Daft worker becomes a Ray actor that owns the full machine resources.

2.2 Why Choose Ray?

Building a fault‑tolerant, resource‑aware scheduler from scratch is costly. Ray offers:

Heterogeneous resource scheduling (CPU, GPU, memory).

Elastic autoscaling with AWS, GCP, or Kubernetes.

Actor model for stateful tasks, useful for loading large model weights once.

Rich ecosystem (Ray Train, Ray Serve, Ray Tune) covering the full ML lifecycle.

2.3 Future – Flotilla Engine

Daft 2025 introduced Flotilla, a lightweight distributed layer that remains Ray‑compatible but can run without Ray for teams that already have a Ray cluster.

3. Lance – An AI‑Optimized Storage Format

3.1 Why a New Format?

Parquet/ORC excel at columnar analytics but lack native support for random access, vector indexing, and multimodal data. Lance (LanceDB) provides a columnar file format, table format, and catalog spec in one, enabling AI‑heavy workloads.

3.2 Key Differences

Feature                Parquet                Lance
Random access          Slow (row‑group scan)  ~100× faster (no row groups)
Vector indexing        Not supported          Native ANN support
Full‑text search       Not supported          Native support
Data versioning        Requires Delta/Iceberg Not built‑in (manifest per commit)
Schema evolution       Rewrite whole table    Zero‑copy column addition
Multimodal storage     URLs only (external)   Inline image/tensor/embedding storage

3.3 Practical Benefits

Random access is ~100× faster, crucial for ML training that samples rows.

Native ANN indexing removes the need for external vector databases.

Zero‑copy schema evolution lets you add a 1 GB embedding column to a 100 GB table by writing only the new column.

3.4 Complementary Strategy

Lance is not a replacement for Iceberg or Parquet; a dual‑format approach uses Iceberg/Parquet for BI and Lance for AI workloads, sharing the same object storage and catalog.

4. End‑to‑End Architecture

4.1 Overall Stack

Daft DataFrame API   ← User layer (Python)
Swordfish / Flotilla ← Execution engine (Rust)
Ray Cluster          ← Distributed scheduler
Lance on S3/GCS      ← Storage layer

Daft reads/writes Lance files directly; Swordfish executes the plan; Ray distributes workers across the cluster.

4.2 Sample Pipeline

import daft
from daft import col

# 1. Switch to Ray distributed mode
daft.context.set_runner_ray()

# 2. Read from Lance dataset
df = daft.read_lance("s3://my-bucket/multimodal-dataset/")

# 3. Basic filtering
df = df.where(
    (col("image_url").is_not_null()) &
    (col("caption").str.length() > 10)
)

# 4. Download and decode images
df = df.with_column(
    "image",
    col("image_url").url.download().image.decode(mode=daft.ImageMode.RGB)
)

# 5. Filter low‑resolution images
df = df.where(
    (col("image").image.width() >= 224) &
    (col("image").image.height() >= 224)
)

# 6. Generate embeddings via a UDF
@daft.udf(return_dtype=daft.DataType.tensor(daft.DataType.float32()))
class EmbeddingModel:
    def __init__(self):
        from sentence_transformers import SentenceTransformer
        self.model = SentenceTransformer("all-MiniLM-L6-v2")
    def __call__(self, texts):
        return self.model.encode(texts.to_pylist())

df = df.with_column("embedding", EmbeddingModel(col("caption")))

# 7. Select final columns and write back to Lance
df = df.select("image_url", "caption", "embedding")
df.write_lance("s3://my-bucket/curated-dataset/")

Key points: url.download() performs asynchronous I/O, keeping CPUs busy. image.decode() is a high‑performance Rust implementation.

The UDF class loads the model once per worker, reusing weights.

The pipeline streams data morsel‑by‑morsel, avoiding full‑dataset materialization.

4.3 Suitable Scenarios

Multimodal dataset curation (deduplication, quality filtering, annotation).

Large‑scale embedding generation and vector search.

Incremental feature engineering with zero‑copy column addition.

High‑throughput ML training data serving (500 M+ IOPS possible).

5. Comparative Overview

Dimension                Spark+Parquet   Ray Data   Daft+Ray+Lance
Multimodal support       No              Partial    Native (Image/Tensor/Embedding)
Execution model          Batch (partition materialization)   Stream (object‑store spill)   Stream with morsel‑level back‑pressure
Vector search             External system   External system   Native ANN
GPU scheduling            Weak (manual)   Declarative   Strong (worker‑level auto)
Learning curve            High (JVM)   Medium (low‑level API)   Low (Pandas‑style API)
Data versioning           Requires Delta/Iceberg   None   Built‑in manifest versioning
Ecosystem maturity        Very high   High   Emerging (rapid growth)
Community size            Huge   Large   Medium (fast‑growing stars)

6. Practical Recommendations & Caveats

6.1 Daft Maturity

Daft is still young; its community, documentation, and third‑party integrations lag behind Spark. For pure SQL/ETL workloads, switching may not be justified yet.

6.2 Ray Operational Cost

Running a Ray cluster adds deployment overhead. Small teams may stay with Daft’s local mode until data size demands scaling.

6.3 Lance Ecosystem Coverage

Current compute integrations focus on Daft, Spark (via lance‑spark), and Ray (via lance‑ray). Major OLAP engines (Trino, StarRocks, Flink) still have early‑stage support.

6.4 Suggested Adoption Path

Install Daft locally ( pip install daft) and run a few multimodal pipelines.

Convert a subset of Parquet data to Lance and benchmark random‑access speed.

When data grows, enable Ray with daft.context.set_runner_ray().

Adopt incrementally: start with AI/ML pipelines before replacing the entire stack.

7. Conclusion

The multimodal AI era demands a stack that can handle massive, heterogeneous data without OOM failures. Daft offers a Rust‑based, back‑pressure‑aware engine; Ray provides scalable, GPU‑aware scheduling; and Lance delivers an AI‑centric storage format with fast random access, native vector indexing, and zero‑copy schema evolution. Together they form a compelling, open‑source alternative to traditional Spark‑centric pipelines for modern data engineering.

data engineeringpythonRustbenchmarkRaymultimodal dataDaftLance
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.