Big Data 31 min read

How Ray Reinvents AI Data Pipelines for Massive Multimodal Inference

This article examines the shortcomings of traditional big‑data engines for AI workloads, presents a Ray‑based heterogeneous fusion architecture that unifies CPU/GPU scheduling, Python ecosystems, and streaming‑batch processing, and details fault‑tolerance, checkpointing, compute‑storage separation, resource‑utilization, scalability, and observability improvements that enable thousands of nodes and dramatically higher GPU efficiency.

DataFunSummit
DataFunSummit
DataFunSummit
How Ray Reinvents AI Data Pipelines for Massive Multimodal Inference

Traditional Big Data Engine Limitations

With the rise of large language models and multimodal AI, processing unstructured multimodal data has become a core data‑processing task. Conventional big‑data engines such as Spark and Flink struggle with heterogeneous resource scheduling, fine‑grained operator‑level allocation, and Python‑native ecosystem compatibility, making them ill‑suited for batch inference workloads that require both CPU and GPU resources.

Traditional big‑data engine limitations diagram
Traditional big‑data engine limitations diagram

Ray‑Based Next‑Generation Data Pipeline

Overall Architecture

To overcome the above constraints, we rebuilt the data pipeline on Ray, a distributed execution engine designed for AI. Ray provides unified heterogeneous resource scheduling, Python‑native development experience, and five core capabilities: cloud‑native scheduling, fused compute, Python ecosystem integration, AI‑centric APIs, and high‑availability.

Ray‑based data pipeline architecture
Ray‑based data pipeline architecture

Streaming‑Batch Computing Paradigm

Ray Data adopts a streaming‑batch model that treats batch inference as a continuous pipeline of CPU‑bound preprocessing and GPU‑bound model inference. This breaks the BSP model of Spark, achieving up to 3× higher throughput for multimodal inference workloads.

Streaming‑batch performance comparison
Streaming‑batch performance comparison

Key Technical Optimizations

Fault Tolerance

Ray’s core already supports task retry and actor rescheduling. We added timeout‑based spill‑back for pending tasks, a plugin‑based node‑failure detector, and a blacklist for repeatedly failing actors. At the Ray Data layer we introduced actor blacklists and operator‑level timeouts to prevent long‑running stalls.

Fault‑tolerance architecture
Fault‑tolerance architecture

Checkpoint & Resume

We introduced a checkpoint mechanism that periodically persists intermediate dataset state to a database, enabling job‑level resume and intermediate result visibility, which is essential for long‑running Iceberg‑based pipelines.

Checkpoint workflow
Checkpoint workflow

Compute‑Storage Separation

To protect against node failures on unstable low‑priority resources, we decoupled the actor (compute) process from the object store (storage). Stable CPU nodes host all actor pools, while GPU‑intensive actors run as Ray Serve deployments on elastic resources. Users only need to switch from map_batches to map_batches_plus and set an environment variable to toggle between fused and separated modes.

Compute‑storage separation architecture
Compute‑storage separation architecture

Resource Utilization Enhancements

We optimized cold‑start by eager environment preparation, merged serialization payloads to cut setup time by 90 %, and implemented exponential autoscaling for actor pools. Additional improvements include bubble elimination in autoscaling, job‑level cluster reuse, checkpoint‑level deployment reuse, operator timeout enforcement, automatic unhealthy‑operator detection, and aggressive shrink‑age policies.

Scalability to Thousands of Nodes

Through systematic load‑balancing of GCS, driver, and Serve controller components, multi‑level elasticity (platform‑level, TKE‑level, and Ray‑level autoscaling), and hot‑update capabilities for cluster size, we expanded Ray Data from 500 nodes/10 k actors to over 2 k nodes/40 k actors, supporting up to 20 operators per job.

Observability Stack

We integrated three UI layers: the internal "峰峦" dashboard for unified cloud‑native workload view, the open‑source Ray Dashboard with extended Grafana metrics, and Ray Flow Insight for multi‑dimensional runtime visualization. A Ray History Server prototype provides complete job‑level tracing and post‑mortem analysis.

Resulting benefits include up to 90 % GPU cost reduction for audio pipelines, >3× GPU utilization for multimodal inference, 40 % higher inference success rates for text pipelines, and 1‑3× faster R&D cycles.

cloud-nativebig dataResource OptimizationDistributed ComputingRayai data pipeline
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.