How Ray and Cloud‑Native Tech Supercharge Large‑Model Offline Inference
This article explains the challenges of large‑model offline (batch) inference, such as GPU memory limits and distributed scheduling, and shows how Ray’s cloud‑native architecture, model partitioning, and Ray Datasets can be used to build efficient, elastic inference frameworks deployed with KubeRay.
Key Challenges of Large‑Model Offline Inference
The talk by senior infrastructure engineer Wang Wanxing at the Volcano Engine Cloud‑Native Meetup introduces how to leverage Ray and cloud‑native advantages for large‑model offline inference (batch inference), which runs massive data batches on models with billions or tens of billions of parameters.
Characteristics of Offline (Batch) Inference
Inference is performed on a whole batch of data, usually massive, so the computation is offline.
The job combines data processing and model inference.
Jobs are large‑scale, distributed, and consume a lot of compute resources.
Unlike online inference, latency is not critical; throughput and resource utilization are the main concerns.
GPU Memory Wall
Model sizes have been growing exponentially (hundreds of times every two years), while a single GPU’s memory only grows about 1.7× every two years, creating a widening gap that forces model partitioning.
Model Partitioning Strategies
Two common ways to split a model:
Pipeline Parallelism (layer‑wise) : Different layers are placed on different GPUs (e.g., L0‑L3 on GPU0, L4‑L7 on GPU1). Layer sizes vary, so the split may be unbalanced.
Tensor Parallelism (weight‑wise) : Weights of the same layer are divided across GPUs (e.g., part of L0’s weights on GPU0, the rest on GPU1). Hybrid or ZeRO‑style splits also exist.
Advantages of Model Partitioning
Support larger models : Enables offline inference of models that exceed a single GPU’s memory.
Cost reduction : Smaller‑memory GPUs can run split models, freeing high‑end GPUs for training.
GPU‑memory reuse : Techniques like NVIDIA Multi‑Process Service can share GPU memory among processes, and partitioning allows more efficient use.
Distributed Scheduling Challenges
Two main requirements:
Heterogeneous resource support : Data preprocessing should run on CPUs while inference runs on GPUs, requiring a framework that can schedule across different resource types.
Elastic resource scheduling : After partitioning, each stage (group of layers) has different compute needs. The framework must dynamically adjust compute allocation so fast stages can release resources to slower ones.
Traditional batch/stream engines like Spark or Flink lack the flexibility to meet these needs.
Performance Goals
For offline jobs, maximize throughput and GPU utilization, minimize data‑to‑disk serialization, keep data in memory, and release under‑utilized GPUs.
Case Study: ViT + ALBERT Dual‑Tower Model
The model is split into three stages: one stage holds a large embedding layer, the other two hold parts of ViT and ALBERT respectively. Stages have different resource demands, illustrating the need for elastic scheduling.
Ray Architecture
Ray consists of three layers:
Infrastructure layer : abstracts underlying clouds, VMs, containers, or Kubernetes pods.
Ray Core layer : provides low‑level, language‑agnostic distributed programming APIs (e.g., @ray.remote, actors, tasks).
High‑level ML libraries : Ray Train, Ray Datasets, etc., enable end‑to‑end ML pipelines.
Ray Datasets offers rich data source connectors, common operators, and a pipeline API that can process data in parallel blocks.
Building an Inference Framework – Version 1
Using the native Ray Datasets pipeline, the model is split into two layer groups (ModelLayers1, ModelLayers2). A Window API creates a pipeline;
map_batchesruns parallel inference on each group via actors. The number of GPUs per actor is configurable, allowing heterogeneous resource usage.
Compared with Spark, Ray avoids repeated model loading and external storage writes, leading to higher execution efficiency.
Limitations of Version 1
Each Window creates and destroys an actor pool, incurring heavy model‑loading overhead.
IO and inference are not overlapped, reducing GPU utilization.
Actor pools lack elasticity; stages with different compute needs waste resources.
Debugging the API parameters is difficult.
No built‑in fault tolerance or speculative execution.
Inference Framework – Version 2 (Streaming Execution)
Version 2 adds streaming semantics to the Ray Datasets pipeline. Stages are linked by bounded Queues that pass Ray object references, not raw data. A stable actor pool per stage is created once and kept alive, enabling elastic scaling: busy stages request more actors, idle stages release them.
The scheduling policy uses “Most Recently Used” to keep busy actors busy while freeing idle ones. Inside each actor, multithreading overlaps IO and inference, improving GPU usage. Queue length limits provide back‑pressure to avoid OOM.
Community Collaboration & New Executor Architecture
The Ray community has proposed a REP that separates Operators and Executors under the Datasets API, offering more flexibility. Our implementation will become an executor in that new architecture.
Ray Cloud‑Native Deployment with KubeRay
KubeRay is an open‑source operator that manages Ray clusters on Kubernetes (head and worker pods). It supports automatic horizontal scaling based on metrics, creating or deleting pods as needed.
Within ByteDance, users submit Ray jobs or notebooks via the internal platform, which interacts with KubeRay through YAML or REST APIs.
Conclusion
The article discussed the challenges of large‑model offline inference and demonstrated how Ray’s cloud‑native stack, model partitioning, and Ray Datasets can be combined to build efficient, elastic inference frameworks, with deployment handled by KubeRay. Future work will deepen community collaboration and explore more Ray‑based scenarios.
ByteDance Cloud Native
Sharing ByteDance's cloud-native technologies, technical practices, and developer events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.