Koordinator v1.6: Enhancing Heterogeneous GPU Scheduling for Cloud‑Native Clusters
Koordinator v1.6 introduces GPU topology‑aware scheduling, end‑to‑end GPU & RDMA joint allocation, fine‑grained GPU sharing, differentiated scoring for GPU vs CPU resources, advanced reservation and mixed‑workload support, plus numerous scheduler and rescheduler optimizations to improve resource utilization and performance in Kubernetes clusters.
Background
Large‑model AI services and high‑performance computing increasingly rely on heterogeneous accelerators such as GPUs, NPUs and RDMA devices. Efficient allocation and scheduling of these resources is a critical challenge for cloud‑native clusters.
Key Features in Koordinator v1.6
1. GPU Topology‑Aware Scheduling
Koordinator detects GPU, CPU and memory topology across NUMA nodes and can enforce placement policies (e.g., same‑NUMA, same‑PCIe, custom partitions). Pods request the policy via annotations.
apiVersion: v1
kind: Pod
metadata:
annotations:
scheduling.koordinator.sh/numa-topology-spec: '{"numaTopologyPolicy":"Restricted", "singleNUMANodeExclusive":"Preferred"}'
spec:
containers:
- resources:
limits:
koordinator.sh/gpu: 200
cpu: 64
memory: 500Gi
requests:
koordinator.sh/gpu: 200
cpu: 64
memory: 500GiAdditional scopes are expressed with scheduling.koordinator.sh/device-allocate-hint to require GPUs on the same PCIe or NUMA node.
2. End‑to‑End GPUDirect RDMA (GDR) Support
Koordinator integrates GPU and RDMA devices, allowing joint allocation of both resources. This reduces inter‑node communication latency for distributed training.
apiVersion: v1
kind: Pod
metadata:
name: pod-vf01
namespace: kubeflow
annotations:
scheduling.koordinator.sh/device-joint-allocate: |-
{"deviceTypes":["gpu","rdma"]}
scheduling.koordinator.sh/device-allocate-hint: |-
{"rdma": {"vfSelector": {}}}
spec:
schedulerName: koord-scheduler
containers:
- name: container-vf
resources:
requests:
koordinator.sh/gpu: 100
koordinator.sh/rdma: 100
limits:
koordinator.sh/gpu: 100
koordinator.sh/rdma: 1003. GPU Sharing with Strong Isolation (HAMi‑Core)
Through the CNCF‑sandbox HAMi project, Koordinator enables multiple pods to share a single GPU while guaranteeing isolation via partition policies. Deploy the HAMi‑Core library with a DaemonSet:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: hami-core-distribute
namespace: default
spec:
selector:
matchLabels:
koord-app: hami-core-distribute
template:
metadata:
labels:
koord-app: hami-core-distribute
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values:
- "gpu"
containers:
- name: hami
image: docker.m.daocloud.io/projecthami/hami:v2.4.0
command: ["/bin/sh","-c","cp -f /k8s-vgpu/lib/nvidia/libvgpu.so /usl/local/vgpu && sleep 3600000"]
resources:
limits:
cpu: 200m
memory: 256Mi
volumeMounts:
- name: vgpu-hook
mountPath: /usl/local/vgpu
- name: vgpu-lock
mountPath: /tmp/vgpulock
tolerations:
- operator: Exists
volumes:
- name: vgpu-hook
hostPath:
path: /usl/local/vgpu
type: DirectoryOrCreate
- name: vgpu-lock
hostPath:
path: /tmp/vgpulock
type: DirectoryOrCreatePods request shared GPU resources with annotations such as koordinator.sh/gpu-shared, koordinator.sh/gpu-core, and koordinator.sh/gpu-memory-ratio.
4. Differential GPU Scheduling Strategies
The new NodeResourcesFitPlus plugin allows distinct scoring strategies per resource type (e.g., MostAllocated for GPUs, LeastAllocated for CPU/Memory) to reduce GPU fragmentation when GPU‑heavy and CPU‑heavy workloads coexist.
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- name: NodeResourcesFitPlus
args:
apiVersion: kubescheduler.config.k8s.io/v1
kind: NodeResourcesFitPlusArgs
resources:
nvidia.com/gpu:
type: MostAllocated
weight: 2
cpu:
type: LeastAllocated
weight: 1
memory:
type: LeastAllocated
weight: 1
plugins:
score:
enabled:
- name: NodeResourcesFitPlus
weight: 2
schedulerName: koord-schedulerThe ScarceResourceAvoidance plugin keeps non‑GPU pods away from GPU nodes unless they request GPUs.
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- name: ScarceResourceAvoidance
args:
apiVersion: kubescheduler.config.k8s.io/v1
kind: ScarceResourceAvoidanceArgs
resources:
- nvidia.com/gpu
plugins:
score:
enabled:
- name: NodeResourcesFitPlus
weight: 2
- name: ScarceResourceAvoidance
weight: 2
disabled:
- name: "*"
schedulerName: koord-scheduler5. Fine‑Grained Resource Reservation
Reservation APIs now support exact‑match reservations, optional ignoring of reservations, and affinity/toleration rules.
# Exact‑match reservation
scheduling.koordinator.sh/exact-match-reservation: '{"resourceNames":{"cpu","memory","nvidia.com/gpu"}}'
# Ignore reservation
scheduling.koordinator.sh/reservation-ignored: "true"
# Reservation affinity by name
scheduling.koordinator.sh/reservation-affinity: '{"name":"test-reservation"}'
# Reservation affinity with tolerations
scheduling.koordinator.sh/reservation-affinity: '{"tolerations":[{"key":"test-taint-key","operator":"Equal","value":"test-taint-value","effect":"NoSchedule"}]}'Reservation preemption is enabled via the Reservation plugin:
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- pluginConfigs:
- name: Reservation
args:
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: ReservationArgs
enablePreemption: true
plugins:
postFilter:
disabled:
- name: DefaultPreemption
enabled:
- name: Reservation6. Mixed‑Workload (Mid‑Tier) Enhancements
Improvements include over‑commit handling, node‑profile‑based calculations, and QoS extensions such as Resctrl (LLC & memory‑bandwidth isolation) and per‑pod CPU QoS.
# Example Resctrl annotation
node.koordinator.sh/resctrl: '{"llc":{"schemata":{"range":[0,30]}},"mb":{"schemata":{"percent":20}}}'
# Example CPU QoS annotation
koordinator.sh/cpuQOS: '{"groupIdentity":1}'7. Scheduler & Rescheduler Optimizations
Performance improvements:
PodGroup size checks moved earlier (PreEnqueue).
Reservation resource return deferred to AfterPreFilter.
Reduced CycleState memory for NodeNUMAResource, DeviceShare, and Reservation plugins.
Latency metrics added for new extension points (BeforePreFilter, AfterPreFilter).
Rescheduling gains:
LowNodeLoad scoring now supports ProdHighThresholds/ProdLowThresholds and smarter eviction ordering.
MigrationController gains namespace‑level eviction rate limiting, ObjectLimiter integration, EvictAllBarePods option, and MaxMigratingGlobally limit.
Global MaxNoOfPodsToEvictTotal limits total evictions per cycle.
Future Plans
Upcoming proposals focus on:
Fine‑grained device scheduling for Huawei NPU (https://github.com/koordinator-sh/koordinator/issues/2335).
A dedicated rescheduling plugin to address resource imbalance (https://github.com/koordinator-sh/koordinator/issues/2332).
Extending reservations to bind already‑assigned pods (https://github.com/koordinator-sh/koordinator/issues/2150).
Long‑term goals include an end‑to‑end evolvable device‑management framework (https://github.com/koordinator-sh/koordinator/issues/2181).
References
NUMA Topology Scheduling: https://github.com/koordinator-sh/koordinator/blob/main/docs/proposals/scheduling/20230415-numa-topology-scheduling.md
Device Allocate Hint API: https://github.com/koordinator-sh/koordinator/blob/main/docs/proposals/scheduling/20230803-device-allocate-hint-apis.md
GPU Partition APIs: https://github.com/koordinator-sh/koordinator/blob/main/docs/proposals/scheduling/20241008-gpu-partition-api.md
Best‑practice guide for GPU & RDMA joint allocation: https://koordinator.sh/docs/next/best-practices/gpu-and-rdma-joint-allocation/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
