Cloud Native 17 min read

Koordinator v1.7.0 Brings Network‑Aware Scheduling and Job‑Level Preemption for AI Workloads

Koordinator v1.7.0, the open‑source Kubernetes scheduler, adds network‑topology‑aware scheduling, job‑level preemption, and support for Ascend NPU and Cambricon MLU, delivering unified heterogeneous device management, enhanced GPU sharing, comprehensive API documentation, and best‑practice guides to improve large‑scale AI training efficiency and cluster operations.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Koordinator v1.7.0 Brings Network‑Aware Scheduling and Job‑Level Preemption for AI Workloads

Background

As AI models grow, large‑language‑model (LLM) and distributed AI training put unprecedented pressure on cluster resource scheduling, requiring efficient cross‑node communication, intelligent resource preemption, and unified heterogeneous device management.

Since its open‑source launch in April 2022, Koordinator has released 15 major versions, with contributions from Alibaba, Ant Group, Intel, Xiaohongshu, Xiaomi, iQIYI, 360, Youzan and others.

Release Highlights – Koordinator v1.7.0

Network‑topology‑aware scheduling

Job‑level preemption

Support for Ascend NPU and Cambricon MLU

Comprehensive API reference and developer guide

Network‑Topology‑Aware Scheduling

In large‑scale AI training, especially LLMs, tensor, pipeline and data parallelism require high‑bandwidth cross‑GPU communication. Physical network hierarchy (NVLink, block, spine) becomes a bottleneck. Koordinator v1.7.0 adds a scheduler that places Pods with topology constraints into optimal topology domains and, when resources are scarce, performs job‑level preemption to reserve nodes via .status.nominatedNode.

Configuration example (Node label):

apiVersion: v1
kind: Node
metadata:
  name: node-0
  labels:
    network.topology.nvidia.com/block: b1
    network.topology.nvidia.com/spine: s1

ClusterNetworkTopology CR example:

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: ClusterNetworkTopology
metadata:
  name: default
spec:
  networkTopologySpec:
    - labelKey:
        - network.topology.nvidia.com/spine
      topologyLayer: SpineLayer
    - labelKey:
        - network.topology.nvidia.com/block
      parentTopologyLayer: SpineLayer
      topologyLayer: BlockLayer
    - parentTopologyLayer: BlockLayer
      topologyLayer: NodeTopologyLayer

PodGroup with topology annotation:

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: training-job
  namespace: default
  annotations:
    gang.scheduling.koordinator.sh/network-topology-spec: |
      {
        "gatherStrategy": [
          {"layer":"BlockLayer","strategy":"PreferGather"}
        ]
      }
spec:
  minMember: 8
  scheduleTimeoutSeconds: 300

Job‑Level Preemption

Traditional pod‑level preemption cannot guarantee that all members of a distributed job are scheduled together. Koordinator v1.7.0 introduces job‑level preemption that triggers only when the entire GangGroup can be placed, using status.nominatedNode to hold resources.

Preemption workflow steps (1) detect unschedulable Pod, (2) identify its GangGroup, (3) check preemption eligibility, (4) select candidate nodes by simulating victim removal, (5) apply a job‑aware cost model, (6) evict victims and set nominatedNode.

PriorityClass examples:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
preemptionPolicy: PreemptLowerPriority
description: "Critical AI training jobs that may preempt others."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 1000
preemptionPolicy: PreemptLowerPriority
description: "Non‑critical workloads."

High‑priority gang job example:

apiVersion: scheduling.sigs.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: hp-training-job
  namespace: default
spec:
  minMember: 2
  scheduleTimeoutSeconds: 300
---
apiVersion: v1
kind: Pod
metadata:
  name: hp-worker-1
  namespace: default
  labels:
    pod-group.scheduling.sigs.k8s.io: hp-training-job
spec:
  schedulerName: koord-scheduler
  priorityClassName: high-priority
  preemptionPolicy: PreemptLowerPriority
  containers:
  - name: worker
    resources:
      limits:
        cpu: 3
        memory: 4Gi
      requests:
        cpu: 3
        memory: 4Gi

Heterogeneous Device Scheduling

v1.7.0 extends device scheduling to support Ascend NPU and Cambricon MLU, providing unified device management and scheduling.

Ascend NPU support via koord-device-daemon and koordlet, with features such as device reporting, partition‑aware scheduling, and PCIe/NUMA topology placement. Example Device CR:

apiVersion: scheduling.koordinator.sh/v1alpha1
kind: Device
metadata:
  labels:
    node.koordinator.sh/gpu-model: Ascend-910B3
  annotations:
    scheduling.koordinator.sh/gpu-partitions: |
      {
        "4": [
          {"minors":[0,1,2,3],"gpuLinkType":"HCCS","allocationScore":"1"}
        ]
      }
name: node-1
spec:
  devices:
  - health: true
    id: GPU-fd971b33-4891-fd2e-ed42-ce6adf324615
    minor: 0
    resources:
      koordinator.sh/gpu-memory: 64Gi
      koordinator.sh/gpu-memory-ratio: "100"
    topology:
      busID: 0000:3b:00.0
      nodeID: 0
      pcieID: pci0000:3a
      type: gpu

Cambricon MLU support (full‑card and dynamic‑smlu modes) with unified koordinator.sh/gpu-* resources. Example Pod requesting a virtual MLU:

apiVersion: v1
kind: Pod
metadata:
  name: test-cambricon-partial
  namespace: default
spec:
  schedulerName: koord-scheduler
  containers:
  - name: demo-sleep
    image: ubuntu:18.04
    resources:
      limits:
        koordinator.sh/gpu.shared: "1"
        koordinator.sh/gpu-memory: "1Gi"
        koordinator.sh/gpu-core: "10"
        cambricon.com/mlu.smlu.vcore: "10"
        cambricon.com/mlu.smlu.vmemory: "4"
      requests:
        koordinator.sh/gpu.shared: "1"
        koordinator.sh/gpu-memory: "1Gi"
        koordinator.sh/gpu-core: "10"
        cambricon.com/mlu.smlu.vcore: "10"
        cambricon.com/mlu.smlu.vmemory: "4"

Other Enhancements

GPU share and HAMi v2.6.0 with NVIDIA driver ≥570

Helm‑based hami‑daemon chart (0.1.0) for easier deployment

vGPUmonitor component exposing Prometheus metrics (memory usage, core utilization, etc.)

Load‑aware scheduling optimizations (PreFilter cache, dominantResourceWeight, prodUsageIncludeSys, and more)

ElasticQuota Hook Plugin framework improvements (custom quota validation, pod update hooks)

API Reference & Developer Guide

v1.7.0 ships a full API reference covering all custom resource definitions (Recommendation, ClusterColocationProfile, ElasticQuota, Reservation, Device, NodeMetric, …) and client libraries for Go, Python, and other languages, plus a developer guide that details component architecture, metric collection, extensibility points, plugin development, custom scheduling strategies, and webhook extensions.

Best Practice – Batch Colocation Quick Start

The new guide walks users through deploying Koordinator, configuring mixed‑workload profiles, observing resource‑utilization improvements via batch over‑commit, and troubleshooting.

Contributors

Fourteen new developers contributed to v1.7.0, including @ditingdapeng, @Rouzip, @ClanEver, @zheng‑weihao, @cntigers, @LennonChin, @ZhuZhezz, @dabaooline, @bobsongplus, @yccharles, @qingyuanz, @yyrdl, @hwenwur, and @hkttty2009.

Future Plans

Upcoming work includes queue and quota management integration, task scheduling enhancements, heterogeneous scheduling strategies (GPU DRA, broader device support), Kubernetes 1.33 upgrade, pre‑allocation support, and pod‑schedule audit tools.

For more details, see the linked documentation and release notes.

QR code
QR code
KubernetesschedulerAI trainingnetwork topologyHeterogeneous DevicesJob Preemption
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.