Cloud Native 13 min read

Enabling Shared GPU Scheduling in Kubernetes with Extender and Device Plugin

This article explains how to design and implement a Kubernetes extension that allows multiple AI workloads to share a single Nvidia GPU by defining new extended resources, using a scheduler extender and a device plugin, and provides deployment steps, demos, and open‑source references.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Enabling Shared GPU Scheduling in Kubernetes with Extender and Device Plugin

Traditional Kubernetes GPU scheduling assigns an entire GPU card to a single container, which leads to low GPU utilization for AI workloads. The solution introduces fine‑grained GPU resource definitions based on memory (MiB) and card count, enabling multiple pods to share a GPU.

Design Overview

Two new Extended Resources are defined: gpu-mem – GPU memory in MiB. gpu-count – Number of GPU cards.

The design reuses Kubernetes extensibility (Extended Resources, Scheduler Extender, Device Plugin, kubelet) without modifying core components, ensuring portability across Kubernetes versions.

Key Design Principles

Focus on scheduling and deployment; runtime memory control is left to the application (e.g., TensorFlow gpu_options.per_process_gpu_memory_fraction).

Avoid invasive changes to the Kubernetes core; leverage existing APIs.

Support either memory‑based or card‑based scheduling per node, but not both simultaneously.

Architecture

GPU Share Scheduler Extender : Implements filter and bind extensions. During filtering it checks per‑GPU memory availability; during binding it selects the GPU with the smallest sufficient remaining memory (bin‑packing) and records the GPU ID and memory request in pod annotations.

GPU Share Device Plugin : Uses the NVML library to query GPU count and memory, reports gpu-mem and gpu-count as Extended Resources to the kubelet, and performs actual allocation based on scheduler decisions.

Architecture diagram
Architecture diagram

Scheduling Workflow

1. Resource Reporting

The device plugin calls ListAndWatch() to discover GPU count and per‑GPU memory. It reports two aggregated resources to the kubelet and API server: gpu-mem – total memory (GPU count × per‑GPU memory). gpu-count – number of GPU cards.

Example: a node with two 16 GiB GPUs reports gpu-mem=32552 (MiB) and gpu-count=2.

2. Extended Scheduling

The default scheduler performs a coarse filter using the aggregated resources.

If a node passes, the Scheduler Extender runs a second filter that examines each GPU card to ensure enough free memory for the pod’s gpu-mem request.

During binding, the extender selects the GPU with the smallest sufficient remaining memory (bin‑packing) and stores the following annotations on the pod: ALIYUN_COM_GPU_MEM_IDX – selected GPU index. ALIYUN_COM_GPU_MEM_POD – requested memory. ALIYUN_COM_GPU_MEM_ASSUME_TIME – timestamp of the assume operation. ALIYUN_COM_GPU_MEM_ASSIGNED – initially false, set to true after allocation.

Scheduling example
Scheduling example

3. Node Execution

When the pod is bound, the kubelet invokes the device plugin’s Allocate method with the requested gpu-mem. The plugin:

Lists pending pods on the node whose ALIYUN_COM_GPU_MEM_ASSIGNED annotation is false.

Selects the pod whose ALIYUN_COM_GPU_MEM_POD matches the allocation request (preferring the earliest ALIYUN_COM_GPU_MEM_ASSUME_TIME if multiple match).

Marks the pod as assigned ( ALIYUN_COM_GPU_MEM_ASSIGNED=true) and injects GPU information (GPU index, memory) as environment variables for the container runtime.

Allocation flow
Allocation flow

Deployment and Usage

The components are open‑source:

gpushare-scheduler-extender – https://github.com/AliyunContainerService/gpushare-scheduler-extender

gpushare-device-plugin – https://github.com/AliyunContainerService/gpushare-device-plugin

Installation and usage instructions are provided in the repository documentation (install guide, user guide).

Sample Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: binpack-1
  labels:
    app: binpack-1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: binpack-1
  template:
    metadata:
      labels:
        app: binpack-1
    spec:
      containers:
      - name: binpack-1
        image: cheyang/gpu-player:v2
        resources:
          limits:
            # memory in MiB
            aliyun.com/gpu-mem: 1024

Roadmap

Add optional Nvidia MPS support in the device plugin.

Enable automated deployment on kubeadm‑initialized clusters.

Improve high‑availability of the Scheduler Extender.

Extend the approach to other accelerators such as RDMA and elastic network interfaces.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KubernetesGPUDevice PluginExtended ResourcesShared Scheduling
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.