Enabling Shared GPU Scheduling in Kubernetes with Extender and Device Plugin
This article explains how to design and implement a Kubernetes extension that allows multiple AI workloads to share a single Nvidia GPU by defining new extended resources, using a scheduler extender and a device plugin, and provides deployment steps, demos, and open‑source references.
Traditional Kubernetes GPU scheduling assigns an entire GPU card to a single container, which leads to low GPU utilization for AI workloads. The solution introduces fine‑grained GPU resource definitions based on memory (MiB) and card count, enabling multiple pods to share a GPU.
Design Overview
Two new Extended Resources are defined: gpu-mem – GPU memory in MiB. gpu-count – Number of GPU cards.
The design reuses Kubernetes extensibility (Extended Resources, Scheduler Extender, Device Plugin, kubelet) without modifying core components, ensuring portability across Kubernetes versions.
Key Design Principles
Focus on scheduling and deployment; runtime memory control is left to the application (e.g., TensorFlow gpu_options.per_process_gpu_memory_fraction).
Avoid invasive changes to the Kubernetes core; leverage existing APIs.
Support either memory‑based or card‑based scheduling per node, but not both simultaneously.
Architecture
GPU Share Scheduler Extender : Implements filter and bind extensions. During filtering it checks per‑GPU memory availability; during binding it selects the GPU with the smallest sufficient remaining memory (bin‑packing) and records the GPU ID and memory request in pod annotations.
GPU Share Device Plugin : Uses the NVML library to query GPU count and memory, reports gpu-mem and gpu-count as Extended Resources to the kubelet, and performs actual allocation based on scheduler decisions.
Scheduling Workflow
1. Resource Reporting
The device plugin calls ListAndWatch() to discover GPU count and per‑GPU memory. It reports two aggregated resources to the kubelet and API server: gpu-mem – total memory (GPU count × per‑GPU memory). gpu-count – number of GPU cards.
Example: a node with two 16 GiB GPUs reports gpu-mem=32552 (MiB) and gpu-count=2.
2. Extended Scheduling
The default scheduler performs a coarse filter using the aggregated resources.
If a node passes, the Scheduler Extender runs a second filter that examines each GPU card to ensure enough free memory for the pod’s gpu-mem request.
During binding, the extender selects the GPU with the smallest sufficient remaining memory (bin‑packing) and stores the following annotations on the pod: ALIYUN_COM_GPU_MEM_IDX – selected GPU index. ALIYUN_COM_GPU_MEM_POD – requested memory. ALIYUN_COM_GPU_MEM_ASSUME_TIME – timestamp of the assume operation. ALIYUN_COM_GPU_MEM_ASSIGNED – initially false, set to true after allocation.
3. Node Execution
When the pod is bound, the kubelet invokes the device plugin’s Allocate method with the requested gpu-mem. The plugin:
Lists pending pods on the node whose ALIYUN_COM_GPU_MEM_ASSIGNED annotation is false.
Selects the pod whose ALIYUN_COM_GPU_MEM_POD matches the allocation request (preferring the earliest ALIYUN_COM_GPU_MEM_ASSUME_TIME if multiple match).
Marks the pod as assigned ( ALIYUN_COM_GPU_MEM_ASSIGNED=true) and injects GPU information (GPU index, memory) as environment variables for the container runtime.
Deployment and Usage
The components are open‑source:
gpushare-scheduler-extender – https://github.com/AliyunContainerService/gpushare-scheduler-extender
gpushare-device-plugin – https://github.com/AliyunContainerService/gpushare-device-plugin
Installation and usage instructions are provided in the repository documentation (install guide, user guide).
Sample Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: binpack-1
labels:
app: binpack-1
spec:
replicas: 1
selector:
matchLabels:
app: binpack-1
template:
metadata:
labels:
app: binpack-1
spec:
containers:
- name: binpack-1
image: cheyang/gpu-player:v2
resources:
limits:
# memory in MiB
aliyun.com/gpu-mem: 1024Roadmap
Add optional Nvidia MPS support in the device plugin.
Enable automated deployment on kubeadm‑initialized clusters.
Improve high‑availability of the Scheduler Extender.
Extend the approach to other accelerators such as RDMA and elastic network interfaces.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
