Boosting Large-Model Offline Inference with Ray and Cloud-Native Architecture
Large-model offline (batch) inference, which processes massive data on billion-parameter models, faces GPU memory and distributed scheduling challenges; this article explains how Ray's cloud-native framework, model parallelism, and Ray Datasets pipelines address these issues, improve throughput, and enable elastic, efficient GPU utilization.
What is Large Model Offline Inference
Large model offline inference (also called batch inference) refers to distributed inference on models with billions to hundreds of billions of parameters, where a batch of data is processed offline. It typically combines data processing and model inference, runs at large scale, and prioritizes throughput and resource utilization over latency.
Key Challenges
GPU Memory Wall : Model sizes have grown rapidly while GPU compute improvements lag, creating a gap that can cause models to exceed GPU memory, requiring model partitioning.
Distributed Scheduling : Inference jobs need heterogeneous resource support (CPU for data processing, GPU for inference) and elastic resource allocation because different stages have varying compute demands.
Model Partitioning
Two common partitioning methods are:
Pipeline Parallelism – layer-wise splitting across GPUs.
Tensor Parallelism – weight-wise splitting within the same layer across GPUs.
Benefits include supporting larger models on existing hardware, reducing costs by using smaller GPUs for parts of the model, and enabling memory reuse techniques such as NVIDIA Multi-Process Service.
Distributed Scheduling
Existing frameworks like Spark and Flink lack flexible scheduling for heterogeneous resources, making them unsuitable for this workload.
Performance Goals
Offline inference aims for high throughput and GPU utilization, minimizing data transfer overhead, avoiding disk I/O, and releasing underutilized GPUs.
Case Study: Vision‑Transformer + Albert
A multimodal model splits ViT and Albert layers across GPUs, forming three stages with differing resource needs, illustrating the need for elastic allocation.
Ray Overview
Ray, originated from UC Berkeley's RISElab, is a Python-first distributed programming framework. Its architecture consists of a Head node (with GCS and dashboard), Worker nodes (running tasks), Raylet (local scheduler), Object Store (shared memory), Drivers, and Actors. Ray powers many large‑scale ML workloads, including OpenAI's ChatGPT training.
Building a Large‑Model Inference Framework with Ray
Ray Datasets provides rich data source integration, parallel operators, and pipeline execution, making it suitable for batch inference.
Version 1 – Native Ray Dataset Pipeline
Creates a pipeline where each window launches an Actor pool to run a model partition. While improving heterogeneous resource scheduling, it suffers from high Actor startup cost, limited GPU utilization, lack of elasticity, difficult debugging, and no fault tolerance.
Version 2 – Streaming Execution Semantics
Introduces stable Actor pools per stage and queues of Ray object references between stages, enabling back‑pressure, elastic scaling, and concurrent I/O and inference within actors, thus improving GPU utilization and reducing overhead.
Ray Cloud‑Native Deployment (KubeRay)
KubeRay manages Ray clusters on Kubernetes, handling lifecycle, autoscaling, and resource metrics. It is used internally at ByteDance and supported by companies like Microsoft and Ant Group.
Conclusion
The article discussed the challenges of large‑model offline inference and demonstrated how Ray’s cloud‑native capabilities and evolving pipeline designs address GPU memory limits, scheduling flexibility, and performance, with ongoing collaboration with the open‑source community.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
