Koordinator v1.7.0 Brings Network‑Aware Scheduling and Job‑Level Preemption for AI Workloads
Koordinator v1.7.0, the open‑source Kubernetes scheduler, adds network‑topology‑aware scheduling, job‑level preemption, and support for Ascend NPU and Cambricon MLU, delivering unified heterogeneous device management, enhanced GPU sharing, comprehensive API documentation, and best‑practice guides to improve large‑scale AI training efficiency and cluster operations.
Background
As AI models grow, large‑language‑model (LLM) and distributed AI training put unprecedented pressure on cluster resource scheduling, requiring efficient cross‑node communication, intelligent resource preemption, and unified heterogeneous device management.
Since its open‑source launch in April 2022, Koordinator has released 15 major versions, with contributions from Alibaba, Ant Group, Intel, Xiaohongshu, Xiaomi, iQIYI, 360, Youzan and others.
Release Highlights – Koordinator v1.7.0
Network‑topology‑aware scheduling
Job‑level preemption
Support for Ascend NPU and Cambricon MLU
Comprehensive API reference and developer guide
Network‑Topology‑Aware Scheduling
In large‑scale AI training, especially LLMs, tensor, pipeline and data parallelism require high‑bandwidth cross‑GPU communication. Physical network hierarchy (NVLink, block, spine) becomes a bottleneck. Koordinator v1.7.0 adds a scheduler that places Pods with topology constraints into optimal topology domains and, when resources are scarce, performs job‑level preemption to reserve nodes via .status.nominatedNode.
Configuration example (Node label):
apiVersion: v1
kind: Node
metadata:
name: node-0
labels:
network.topology.nvidia.com/block: b1
network.topology.nvidia.com/spine: s1ClusterNetworkTopology CR example:
apiVersion: scheduling.koordinator.sh/v1alpha1
kind: ClusterNetworkTopology
metadata:
name: default
spec:
networkTopologySpec:
- labelKey:
- network.topology.nvidia.com/spine
topologyLayer: SpineLayer
- labelKey:
- network.topology.nvidia.com/block
parentTopologyLayer: SpineLayer
topologyLayer: BlockLayer
- parentTopologyLayer: BlockLayer
topologyLayer: NodeTopologyLayerPodGroup with topology annotation:
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
name: training-job
namespace: default
annotations:
gang.scheduling.koordinator.sh/network-topology-spec: |
{
"gatherStrategy": [
{"layer":"BlockLayer","strategy":"PreferGather"}
]
}
spec:
minMember: 8
scheduleTimeoutSeconds: 300Job‑Level Preemption
Traditional pod‑level preemption cannot guarantee that all members of a distributed job are scheduled together. Koordinator v1.7.0 introduces job‑level preemption that triggers only when the entire GangGroup can be placed, using status.nominatedNode to hold resources.
Preemption workflow steps (1) detect unschedulable Pod, (2) identify its GangGroup, (3) check preemption eligibility, (4) select candidate nodes by simulating victim removal, (5) apply a job‑aware cost model, (6) evict victims and set nominatedNode.
PriorityClass examples:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
preemptionPolicy: PreemptLowerPriority
description: "Critical AI training jobs that may preempt others."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 1000
preemptionPolicy: PreemptLowerPriority
description: "Non‑critical workloads."High‑priority gang job example:
apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
name: hp-training-job
namespace: default
spec:
minMember: 2
scheduleTimeoutSeconds: 300
---
apiVersion: v1
kind: Pod
metadata:
name: hp-worker-1
namespace: default
labels:
pod-group.scheduling.sigs.k8s.io: hp-training-job
spec:
schedulerName: koord-scheduler
priorityClassName: high-priority
preemptionPolicy: PreemptLowerPriority
containers:
- name: worker
resources:
limits:
cpu: 3
memory: 4Gi
requests:
cpu: 3
memory: 4GiHeterogeneous Device Scheduling
v1.7.0 extends device scheduling to support Ascend NPU and Cambricon MLU, providing unified device management and scheduling.
Ascend NPU support via koord-device-daemon and koordlet, with features such as device reporting, partition‑aware scheduling, and PCIe/NUMA topology placement. Example Device CR:
apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Device
metadata:
labels:
node.koordinator.sh/gpu-model: Ascend-910B3
annotations:
scheduling.koordinator.sh/gpu-partitions: |
{
"4": [
{"minors":[0,1,2,3],"gpuLinkType":"HCCS","allocationScore":"1"}
]
}
name: node-1
spec:
devices:
- health: true
id: GPU-fd971b33-4891-fd2e-ed42-ce6adf324615
minor: 0
resources:
koordinator.sh/gpu-memory: 64Gi
koordinator.sh/gpu-memory-ratio: "100"
topology:
busID: 0000:3b:00.0
nodeID: 0
pcieID: pci0000:3a
type: gpuCambricon MLU support (full‑card and dynamic‑smlu modes) with unified koordinator.sh/gpu-* resources. Example Pod requesting a virtual MLU:
apiVersion: v1
kind: Pod
metadata:
name: test-cambricon-partial
namespace: default
spec:
schedulerName: koord-scheduler
containers:
- name: demo-sleep
image: ubuntu:18.04
resources:
limits:
koordinator.sh/gpu.shared: "1"
koordinator.sh/gpu-memory: "1Gi"
koordinator.sh/gpu-core: "10"
cambricon.com/mlu.smlu.vcore: "10"
cambricon.com/mlu.smlu.vmemory: "4"
requests:
koordinator.sh/gpu.shared: "1"
koordinator.sh/gpu-memory: "1Gi"
koordinator.sh/gpu-core: "10"
cambricon.com/mlu.smlu.vcore: "10"
cambricon.com/mlu.smlu.vmemory: "4"Other Enhancements
GPU share and HAMi v2.6.0 with NVIDIA driver ≥570
Helm‑based hami‑daemon chart (0.1.0) for easier deployment
vGPUmonitor component exposing Prometheus metrics (memory usage, core utilization, etc.)
Load‑aware scheduling optimizations (PreFilter cache, dominantResourceWeight, prodUsageIncludeSys, and more)
ElasticQuota Hook Plugin framework improvements (custom quota validation, pod update hooks)
API Reference & Developer Guide
v1.7.0 ships a full API reference covering all custom resource definitions (Recommendation, ClusterColocationProfile, ElasticQuota, Reservation, Device, NodeMetric, …) and client libraries for Go, Python, and other languages, plus a developer guide that details component architecture, metric collection, extensibility points, plugin development, custom scheduling strategies, and webhook extensions.
Best Practice – Batch Colocation Quick Start
The new guide walks users through deploying Koordinator, configuring mixed‑workload profiles, observing resource‑utilization improvements via batch over‑commit, and troubleshooting.
Contributors
Fourteen new developers contributed to v1.7.0, including @ditingdapeng, @Rouzip, @ClanEver, @zheng‑weihao, @cntigers, @LennonChin, @ZhuZhezz, @dabaooline, @bobsongplus, @yccharles, @qingyuanz, @yyrdl, @hwenwur, and @hkttty2009.
Future Plans
Upcoming work includes queue and quota management integration, task scheduling enhancements, heterogeneous scheduling strategies (GPU DRA, broader device support), Kubernetes 1.33 upgrade, pre‑allocation support, and pod‑schedule audit tools.
For more details, see the linked documentation and release notes.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
