How Volcano Engine’s New GPU Sharing Scheduler Boosts AI Workloads by 500%
This article explains Volcano Engine's next‑generation GPU sharing scheduling technology, detailing the two‑layer scheduler, card‑level bin‑pack/spread strategies, system architecture, API definitions, and optimization algorithms that together increase GPU deployment density over 500% and improve utilization by more than 50% for AI workloads.
In the AI era, deploying large models requires infrastructure that can provide massive AI compute power. Modern cloud‑native platforms now need to manage heterogeneous devices such as GPUs and RDMA, along with fine‑grained device management.
Problem Analysis
Native Kubernetes only supports whole‑GPU scheduling, which can waste expensive GPU resources in several scenarios:
AI inference often processes only a single or small batch of inputs.
High‑performance computing may be CPU‑bound, leaving GPU utilization low.
Development environments (e.g., Jupyter notebooks) sometimes need only low‑spec machines.
CI/CD pipelines usually require limited GPU resources for test cases.
Existing GPU sharing solutions (time‑slicing, MPS, MIG) have limitations in memory and compute isolation, fault isolation, and flexibility.
Two‑Layer Scheduling
Volcano Engine VKE extends the Kubernetes Scheduling Framework with a custom GPUShare plugin that supports 1% compute granularity and 1 MiB memory granularity. This two‑layer scheduler first selects a suitable node, then assigns containers to specific GPU combinations on that node.
Card‑Level Binpack/Spread Strategy
The native scheduler offers node‑level Binpack (fill nodes to increase allocation rate) and Spread (distribute pods for fault isolation). After adding the second scheduling layer, GPU cards become a scheduling domain, requiring both node‑level and card‑level Binpack/Spread strategies to reduce fragmentation or improve fault isolation.
System Architecture
The overall mGPU architecture consists of the following components:
Scheduler: Central scheduler built on the Scheduling Framework with the GPUShare plugin. It (1) schedules Pods to appropriate nodes and (2) schedules each container to a suitable GPU combination, recording results in Pod annotations.
mGPU Device Plugin: Manages mGPU resources on each node. It (1) publishes mGPU resources to the node object and (2) injects environment variables into containers based on the scheduler’s allocation.
API Definition
Nodes report available mGPU resources as extended resources, with separate dimensions for compute and memory. Example Node object:
<code>apiVersion: v1
kind: Node
metadata:
name: 10.xx.yy.zz
spec:
...
status:
allocatable:
vke.volcengine.com/mgpu-core: "400" # compute, percent
vke.volcengine.com/mgpu-memory: "130040" # memory, MiB
capacity:
vke.volcengine.com/mgpu-core: "400"
vke.volcengine.com/mgpu-memory: "130040"
...</code>Pods request mGPU resources in
.spec.containers[i].resources. Example Pod requesting 30% compute and 1 GiB memory:
<code>apiVersion: v1
kind: Pod
metadata:
name: test-mgpu
namespace: default
spec:
containers:
- name: app
resources:
limits:
vke.volcengine.com/mgpu-core: "30"
vke.volcengine.com/mgpu-memory: "1024"
requests:
vke.volcengine.com/mgpu-core: "30"
vke.volcengine.com/mgpu-memory: "1024"
...</code>After successful scheduling, results are stored in Pod annotations, e.g., the container
appis assigned to GPU index 3 on node
10.xx.yy.zz.
Scheduling Algorithm
The problem is formulated as an optimization problem. The scheduler evaluates each possible GPU combination on a node, applying both node‑level and card‑level Binpack/Spread strategies.
Objective Function
Score = 0.7 × memory‑dimension score + 0.3 × compute‑dimension score. Memory weight is higher because memory cannot be compressed.
Constraints
All GPUs in a combination must reside on the same node.
The combination must satisfy each container’s compute and memory requests.
Other scheduling constraints are applied after the optimal node is selected.
Search Algorithm
A depth‑first search (DFS) with backtracking and pruning explores all feasible GPU combinations. The search tree’s depth equals the number of containers; each level represents assigning a container to a GPU. Pruning occurs when a partial assignment violates resource constraints. When a leaf node is reached, the combination is scored, and the best‑scoring combination is retained.
Example: a Pod with three containers on a node with three GPUs. The DFS explores assignments, pruning infeasible paths, and finally selects the optimal GPU set.
Summary and Outlook
The GPU sharing scheduler and mGPU virtualization are now available in Volcano Engine’s VKE service. Real‑world tests show a GPU deployment density increase of over 500% and a utilization improvement exceeding 50%.
VKE currently supports scheduling a single container across multiple GPUs and integrates with major batch schedulers. Future work includes GPU topology‑aware scheduling, mixed‑workload placement, and further enhancements to boost AI model training efficiency.
ByteDance Cloud Native
Sharing ByteDance's cloud-native technologies, technical practices, and developer events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.