Artificial Intelligence 18 min read

How Ray and Cloud‑Native Tech Supercharge Large‑Model Offline Inference

This article explains the challenges of large‑model offline (batch) inference, such as GPU memory limits and distributed scheduling, and shows how Ray’s cloud‑native architecture, model partitioning, and Ray Datasets can be used to build efficient, elastic inference frameworks deployed with KubeRay.

ByteDance Cloud Native
ByteDance Cloud Native
ByteDance Cloud Native
How Ray and Cloud‑Native Tech Supercharge Large‑Model Offline Inference

Key Challenges of Large‑Model Offline Inference

The talk by senior infrastructure engineer Wang Wanxing at the Volcano Engine Cloud‑Native Meetup introduces how to leverage Ray and cloud‑native advantages for large‑model offline inference (batch inference), which runs massive data batches on models with billions or tens of billions of parameters.

Characteristics of Offline (Batch) Inference

Inference is performed on a whole batch of data, usually massive, so the computation is offline.

The job combines data processing and model inference.

Jobs are large‑scale, distributed, and consume a lot of compute resources.

Unlike online inference, latency is not critical; throughput and resource utilization are the main concerns.

GPU Memory Wall

Model sizes have been growing exponentially (hundreds of times every two years), while a single GPU’s memory only grows about 1.7× every two years, creating a widening gap that forces model partitioning.

Model Partitioning Strategies

Two common ways to split a model:

Pipeline Parallelism (layer‑wise) : Different layers are placed on different GPUs (e.g., L0‑L3 on GPU0, L4‑L7 on GPU1). Layer sizes vary, so the split may be unbalanced.

Tensor Parallelism (weight‑wise) : Weights of the same layer are divided across GPUs (e.g., part of L0’s weights on GPU0, the rest on GPU1). Hybrid or ZeRO‑style splits also exist.

Advantages of Model Partitioning

Support larger models : Enables offline inference of models that exceed a single GPU’s memory.

Cost reduction : Smaller‑memory GPUs can run split models, freeing high‑end GPUs for training.

GPU‑memory reuse : Techniques like NVIDIA Multi‑Process Service can share GPU memory among processes, and partitioning allows more efficient use.

Distributed Scheduling Challenges

Two main requirements:

Heterogeneous resource support : Data preprocessing should run on CPUs while inference runs on GPUs, requiring a framework that can schedule across different resource types.

Elastic resource scheduling : After partitioning, each stage (group of layers) has different compute needs. The framework must dynamically adjust compute allocation so fast stages can release resources to slower ones.

Traditional batch/stream engines like Spark or Flink lack the flexibility to meet these needs.

Performance Goals

For offline jobs, maximize throughput and GPU utilization, minimize data‑to‑disk serialization, keep data in memory, and release under‑utilized GPUs.

Case Study: ViT + ALBERT Dual‑Tower Model

The model is split into three stages: one stage holds a large embedding layer, the other two hold parts of ViT and ALBERT respectively. Stages have different resource demands, illustrating the need for elastic scheduling.

Ray Architecture

Ray consists of three layers:

Infrastructure layer : abstracts underlying clouds, VMs, containers, or Kubernetes pods.

Ray Core layer : provides low‑level, language‑agnostic distributed programming APIs (e.g., @ray.remote, actors, tasks).

High‑level ML libraries : Ray Train, Ray Datasets, etc., enable end‑to‑end ML pipelines.

Ray Datasets offers rich data source connectors, common operators, and a pipeline API that can process data in parallel blocks.

Building an Inference Framework – Version 1

Using the native Ray Datasets pipeline, the model is split into two layer groups (ModelLayers1, ModelLayers2). A Window API creates a pipeline;

map_batches

runs parallel inference on each group via actors. The number of GPUs per actor is configurable, allowing heterogeneous resource usage.

Compared with Spark, Ray avoids repeated model loading and external storage writes, leading to higher execution efficiency.

Limitations of Version 1

Each Window creates and destroys an actor pool, incurring heavy model‑loading overhead.

IO and inference are not overlapped, reducing GPU utilization.

Actor pools lack elasticity; stages with different compute needs waste resources.

Debugging the API parameters is difficult.

No built‑in fault tolerance or speculative execution.

Inference Framework – Version 2 (Streaming Execution)

Version 2 adds streaming semantics to the Ray Datasets pipeline. Stages are linked by bounded Queues that pass Ray object references, not raw data. A stable actor pool per stage is created once and kept alive, enabling elastic scaling: busy stages request more actors, idle stages release them.

The scheduling policy uses “Most Recently Used” to keep busy actors busy while freeing idle ones. Inside each actor, multithreading overlaps IO and inference, improving GPU usage. Queue length limits provide back‑pressure to avoid OOM.

Community Collaboration & New Executor Architecture

The Ray community has proposed a REP that separates Operators and Executors under the Datasets API, offering more flexibility. Our implementation will become an executor in that new architecture.

Ray Cloud‑Native Deployment with KubeRay

KubeRay is an open‑source operator that manages Ray clusters on Kubernetes (head and worker pods). It supports automatic horizontal scaling based on metrics, creating or deleting pods as needed.

Within ByteDance, users submit Ray jobs or notebooks via the internal platform, which interacts with KubeRay through YAML or REST APIs.

Conclusion

The article discussed the challenges of large‑model offline inference and demonstrated how Ray’s cloud‑native stack, model partitioning, and Ray Datasets can be combined to build efficient, elastic inference frameworks, with deployment handled by KubeRay. Future work will deepen community collaboration and explore more Ray‑based scenarios.

cloud nativedistributed computingRaylarge modeloffline inferenceGPU memory
ByteDance Cloud Native
Written by

ByteDance Cloud Native

Sharing ByteDance's cloud-native technologies, technical practices, and developer events.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.