Cloud Native 27 min read

Koordinator v1.6: Enhancing Heterogeneous GPU Scheduling for Cloud‑Native Clusters

Koordinator v1.6 introduces GPU topology‑aware scheduling, end‑to‑end GPU & RDMA joint allocation, fine‑grained GPU sharing, differentiated scoring for GPU vs CPU resources, advanced reservation and mixed‑workload support, plus numerous scheduler and rescheduler optimizations to improve resource utilization and performance in Kubernetes clusters.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Koordinator v1.6: Enhancing Heterogeneous GPU Scheduling for Cloud‑Native Clusters

Background

Large‑model AI services and high‑performance computing increasingly rely on heterogeneous accelerators such as GPUs, NPUs and RDMA devices. Efficient allocation and scheduling of these resources is a critical challenge for cloud‑native clusters.

Key Features in Koordinator v1.6

1. GPU Topology‑Aware Scheduling

Koordinator detects GPU, CPU and memory topology across NUMA nodes and can enforce placement policies (e.g., same‑NUMA, same‑PCIe, custom partitions). Pods request the policy via annotations.

apiVersion: v1
kind: Pod
metadata:
  annotations:
    scheduling.koordinator.sh/numa-topology-spec: '{"numaTopologyPolicy":"Restricted", "singleNUMANodeExclusive":"Preferred"}'
spec:
  containers:
  - resources:
      limits:
        koordinator.sh/gpu: 200
        cpu: 64
        memory: 500Gi
      requests:
        koordinator.sh/gpu: 200
        cpu: 64
        memory: 500Gi

Additional scopes are expressed with scheduling.koordinator.sh/device-allocate-hint to require GPUs on the same PCIe or NUMA node.

2. End‑to‑End GPUDirect RDMA (GDR) Support

Koordinator integrates GPU and RDMA devices, allowing joint allocation of both resources. This reduces inter‑node communication latency for distributed training.

apiVersion: v1
kind: Pod
metadata:
  name: pod-vf01
  namespace: kubeflow
  annotations:
    scheduling.koordinator.sh/device-joint-allocate: |- 
      {"deviceTypes":["gpu","rdma"]}
    scheduling.koordinator.sh/device-allocate-hint: |- 
      {"rdma": {"vfSelector": {}}}
spec:
  schedulerName: koord-scheduler
  containers:
  - name: container-vf
    resources:
      requests:
        koordinator.sh/gpu: 100
        koordinator.sh/rdma: 100
      limits:
        koordinator.sh/gpu: 100
        koordinator.sh/rdma: 100

3. GPU Sharing with Strong Isolation (HAMi‑Core)

Through the CNCF‑sandbox HAMi project, Koordinator enables multiple pods to share a single GPU while guaranteeing isolation via partition policies. Deploy the HAMi‑Core library with a DaemonSet:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: hami-core-distribute
  namespace: default
spec:
  selector:
    matchLabels:
      koord-app: hami-core-distribute
  template:
    metadata:
      labels:
        koord-app: hami-core-distribute
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node-type
                operator: In
                values:
                - "gpu"
      containers:
      - name: hami
        image: docker.m.daocloud.io/projecthami/hami:v2.4.0
        command: ["/bin/sh","-c","cp -f /k8s-vgpu/lib/nvidia/libvgpu.so /usl/local/vgpu && sleep 3600000"]
        resources:
          limits:
            cpu: 200m
            memory: 256Mi
        volumeMounts:
        - name: vgpu-hook
          mountPath: /usl/local/vgpu
        - name: vgpu-lock
          mountPath: /tmp/vgpulock
      tolerations:
      - operator: Exists
      volumes:
      - name: vgpu-hook
        hostPath:
          path: /usl/local/vgpu
          type: DirectoryOrCreate
      - name: vgpu-lock
        hostPath:
          path: /tmp/vgpulock
          type: DirectoryOrCreate

Pods request shared GPU resources with annotations such as koordinator.sh/gpu-shared, koordinator.sh/gpu-core, and koordinator.sh/gpu-memory-ratio.

4. Differential GPU Scheduling Strategies

The new NodeResourcesFitPlus plugin allows distinct scoring strategies per resource type (e.g., MostAllocated for GPUs, LeastAllocated for CPU/Memory) to reduce GPU fragmentation when GPU‑heavy and CPU‑heavy workloads coexist.

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
  - name: NodeResourcesFitPlus
    args:
      apiVersion: kubescheduler.config.k8s.io/v1
      kind: NodeResourcesFitPlusArgs
      resources:
        nvidia.com/gpu:
          type: MostAllocated
          weight: 2
        cpu:
          type: LeastAllocated
          weight: 1
        memory:
          type: LeastAllocated
          weight: 1
  plugins:
    score:
      enabled:
      - name: NodeResourcesFitPlus
        weight: 2
  schedulerName: koord-scheduler

The ScarceResourceAvoidance plugin keeps non‑GPU pods away from GPU nodes unless they request GPUs.

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
  - name: ScarceResourceAvoidance
    args:
      apiVersion: kubescheduler.config.k8s.io/v1
      kind: ScarceResourceAvoidanceArgs
      resources:
      - nvidia.com/gpu
  plugins:
    score:
      enabled:
      - name: NodeResourcesFitPlus
        weight: 2
      - name: ScarceResourceAvoidance
        weight: 2
      disabled:
      - name: "*"
  schedulerName: koord-scheduler

5. Fine‑Grained Resource Reservation

Reservation APIs now support exact‑match reservations, optional ignoring of reservations, and affinity/toleration rules.

# Exact‑match reservation
scheduling.koordinator.sh/exact-match-reservation: '{"resourceNames":{"cpu","memory","nvidia.com/gpu"}}'

# Ignore reservation
scheduling.koordinator.sh/reservation-ignored: "true"

# Reservation affinity by name
scheduling.koordinator.sh/reservation-affinity: '{"name":"test-reservation"}'

# Reservation affinity with tolerations
scheduling.koordinator.sh/reservation-affinity: '{"tolerations":[{"key":"test-taint-key","operator":"Equal","value":"test-taint-value","effect":"NoSchedule"}]}'

Reservation preemption is enabled via the Reservation plugin:

apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- pluginConfigs:
  - name: Reservation
    args:
      apiVersion: kubescheduler.config.k8s.io/v1beta3
      kind: ReservationArgs
      enablePreemption: true
  plugins:
    postFilter:
      disabled:
      - name: DefaultPreemption
      enabled:
      - name: Reservation

6. Mixed‑Workload (Mid‑Tier) Enhancements

Improvements include over‑commit handling, node‑profile‑based calculations, and QoS extensions such as Resctrl (LLC & memory‑bandwidth isolation) and per‑pod CPU QoS.

# Example Resctrl annotation
node.koordinator.sh/resctrl: '{"llc":{"schemata":{"range":[0,30]}},"mb":{"schemata":{"percent":20}}}'

# Example CPU QoS annotation
koordinator.sh/cpuQOS: '{"groupIdentity":1}'

7. Scheduler & Rescheduler Optimizations

Performance improvements:

PodGroup size checks moved earlier (PreEnqueue).

Reservation resource return deferred to AfterPreFilter.

Reduced CycleState memory for NodeNUMAResource, DeviceShare, and Reservation plugins.

Latency metrics added for new extension points (BeforePreFilter, AfterPreFilter).

Rescheduling gains:

LowNodeLoad scoring now supports ProdHighThresholds/ProdLowThresholds and smarter eviction ordering.

MigrationController gains namespace‑level eviction rate limiting, ObjectLimiter integration, EvictAllBarePods option, and MaxMigratingGlobally limit.

Global MaxNoOfPodsToEvictTotal limits total evictions per cycle.

Future Plans

Upcoming proposals focus on:

Fine‑grained device scheduling for Huawei NPU (https://github.com/koordinator-sh/koordinator/issues/2335).

A dedicated rescheduling plugin to address resource imbalance (https://github.com/koordinator-sh/koordinator/issues/2332).

Extending reservations to bind already‑assigned pods (https://github.com/koordinator-sh/koordinator/issues/2150).

Long‑term goals include an end‑to‑end evolvable device‑management framework (https://github.com/koordinator-sh/koordinator/issues/2181).

References

NUMA Topology Scheduling: https://github.com/koordinator-sh/koordinator/blob/main/docs/proposals/scheduling/20230415-numa-topology-scheduling.md

Device Allocate Hint API: https://github.com/koordinator-sh/koordinator/blob/main/docs/proposals/scheduling/20230803-device-allocate-hint-apis.md

GPU Partition APIs: https://github.com/koordinator-sh/koordinator/blob/main/docs/proposals/scheduling/20241008-gpu-partition-api.md

Best‑practice guide for GPU & RDMA joint allocation: https://koordinator.sh/docs/next/best-practices/gpu-and-rdma-joint-allocation/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Resource ManagementKoordinatorHeterogeneous Devices
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.