Cloud Native 24 min read

What’s New in Koordinator v1.4.0? A Deep Dive into Mixed‑Workload Scheduling and Resource Optimizations

Koordinator v1.4.0 introduces mixed K8s/YARN workloads, NUMA‑aware scheduling, CPU‑normalization, enhanced ElasticQuota with tree structures and non‑preemptible pods, cold‑memory reporting, QoS for non‑containerized applications, and a suite of bug‑fixes and performance improvements for enterprise Kubernetes clusters.

Alibaba Cloud Native

Jan 16, 2024

What’s New in Koordinator v1.4.0? A Deep Dive into Mixed‑Workload Scheduling and Resource Optimizations

Background

Koordinator is an open‑source project launched in April 2022 that provides a comprehensive solution for mixed‑workload orchestration, resource scheduling, isolation, and performance tuning within Kubernetes clusters. The community includes contributors from Alibaba, Ant Finance, Intel, Xiaomi, and other enterprises.

Release v1.4.0 Overview

The v1.4.0 release adds several major capabilities:

Kubernetes + YARN mixed‑workload support

NUMA topology alignment policies

CPU normalization across heterogeneous nodes

Cold‑memory reporting

ElasticQuota enhancements (tree structure, multi‑quota trees, non‑preemptible pods, namespace annotations, overhead‑ignore feature)

Improved rescheduling protection for PodMigrationJob

QoS management for non‑containerized host applications

Various bug‑fixes and performance optimizations

1. Mixed K8s & YARN Workloads

Koordinator YARN Copilot enables Hadoop NodeManager to run inside a Kubernetes cluster, allowing offline Hadoop/YARN tasks to share nodes with containerized workloads. Key characteristics:

Built on the open‑source Hadoop YARN version without invasive changes

Unified resource priority and QoS policies between YARN and Kubernetes

Node‑level resource sharing so both Pods and YARN tasks can consume the same resources

2. NUMA Topology Alignment

High‑performance workloads (e.g., machine learning) often require CPU, GPU, and RDMA resources to reside on the same NUMA node. The native Kubelet Topology Manager lacks a global view, leading to scheduling failures. Koordinator moves NUMA selection to the central scheduler and currently offers an alpha implementation for both CPU and GPU resources.

Supported policies (set via a node label node.koordinator.sh/numa-topology-policy):

None : No alignment.

BestEffort : Allocate resources as long as total capacity satisfies the Pod.

Restricted : All resources must come from the same NUMA node(s); multiple nodes are allowed.

SingleNUMANode : Same as Restricted but restricts allocation to a single NUMA node.

Example node configuration:

apiVersion: v1
kind: Node
metadata:
  labels:
    node.koordinator.sh/numa-topology-policy: "SingleNUMANode"
  name: node-0
spec:
  ...

Scoring strategies for the NodeNUMAResource plugin can be set to LeastAllocated (default) or MostAllocated via the scheduler configuration:

apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
  - name: NodeNUMAResource
    args:
      scoringStrategy:
        type: MostAllocated
        resources:
        - name: cpu
          weight: 1
        - name: memory
          weight: 1
        - name: "kubernetes.io/batch-cpu"
          weight: 1
        - name: "kubernetes.io/batch-memory"
          weight: 1
      numaScoringStrategy:
        type: MostAllocated
        resources:
        - name: cpu
          weight: 1
        - name: memory
          weight: 1
        - name: "kubernetes.io/batch-cpu"
          weight: 1
        - name: "kubernetes.io/batch-memory"
          weight: 1

3. ElasticQuota Evolution

ElasticQuota now supports a hierarchical tree structure. The root quota is exposed as a CRD ( koordinator-root-quota) so users can view it directly.

3.1 Multi‑QuotaTree

Clusters with heterogeneous node types (e.g., amd64 vs. arm64) can be partitioned into separate quota trees. Users define an ElasticQuotaProfile that selects nodes via nodeSelector and binds them to a specific root quota.

apiVersion: quota.koordinator.sh/v1alpha1
kind: ElasticQuotaProfile
metadata:
  labels:
    kubernetes.io/arch: amd64
  name: amd64-profile
  namespace: kube-system
spec:
  nodeSelector:
    matchLabels:
      kubernetes.io/arch: amd64
  quotaName: amd64-root-quota
---
apiVersion: quota.koordinator.sh/v1alpha1
kind: ElasticQuotaProfile
metadata:
  labels:
    kubernetes.io/arch: arm64
  name: arm64-profile
  namespace: kube-system
spec:
  nodeSelector:
    matchLabels:
      kubernetes.io/arch: arm64
  quotaName: arm64-root-quota

3.2 Non‑Preemptible Pods

Pods can declare quota.scheduling.koordinator.sh/preemptible: false to prevent them from being evicted by ElasticQuota borrowing mechanisms.

apiVersion: v1
kind: Pod
metadata:
  name: pod-example
  namespace: default
  labels:
    quota.scheduling.koordinator.sh/name: "quota-example"
    quota.scheduling.koordinator.sh/preemptible: false
spec:
  ...

3.3 Additional Improvements

Namespace‑scoped ElasticQuota via quota.scheduling.koordinator.sh/namespaces annotation (JSON array).

Optimized rebuild of the quota tree on changes.

Feature gate ElasticQuotaIgnorePodOverhead to ignore pod overhead in quota calculations.

4. CPU Normalization

Heterogeneous CPUs exhibit different performance. Koordinator introduces a normalization layer that scales the allocatable CPU on each node so that a “CPU unit” provides comparable compute power across architectures.

Normalization coefficients are configured in the slo-controller-config ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: slo-controller-config
  namespace: koordinator-system
data:
  cpu-normalization-config: |
    {
      "enable": true,
      "ratioModel": {
        "Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz": {
          "baseRatio": 1.29,
          "hyperThreadEnabledRatio": 0.82,
          "turboEnabledRatio": 1.52,
          "hyperThreadTurboEnabledRatio": 1.0
        },
        "Intel Xeon Platinum 8369B CPU @ 2.90GHz": {
          "baseRatio": 1.69,
          "hyperThreadEnabledRatio": 1.06,
          "turboEnabledRatio": 1.91,
          "hyperThreadTurboEnabledRatio": 1.20
        }
      }
    }
    # ...

Koordinator’s webhook intercepts Kubelet updates to Node.Status.Allocatable and rewrites the CPU values according to the configured ratios.

5. Improved Rescheduling Protection

PodMigrationJob now uses an arbitration mechanism that sorts and filters jobs before execution.

Sorting criteria

Shortest time since migration start.

Lower Pod priority ranks higher.

Jobs from the same workload are grouped together.

Jobs belonging to workloads that already have active migrations rank higher.

Filtering criteria

Group by workload, node, namespace and apply limits.

Reject jobs if the number of running migrations for a workload exceeds a threshold.

Reject if the number of unavailable replicas exceeds the allowed maximum.

Reject if the target node already exceeds its maximum concurrent migrations.

6. Cold‑Memory Reporting

Koordinator adds a collector plugin that reads memory.idle_stat exported by kidled, kstaled, or DAMON cgroup files. The plugin stores hot‑page and cold‑page metrics in NodeMetric CRDs, which can be consumed by the cold‑memory reclamation feature (future work).

Supported collection strategies: usageWithHotPageCache, usageWithoutPageCache, usageWithPageCache.

7. QoS for Non‑Containerized Host Applications

Koordinator provides resource reservation for legacy host processes, exposing two QoS classes:

LS (Latency Sensitive) : CPU QoS Group Identity and full CPUSet allocation.

BE (Best‑effort) : CPU QoS Group Identity only.

Users must place processes into the appropriate cgroup; Koordinator does not yet automate cgroup migration.

8. Other Notable Features & Bugfixes

RequiredCPUBindPolicy : Strict CPU binding enforcement.

CI/CD : New e2e test pipeline and ARM64 images.

Batch resource calculation : Added maxUsageRequest strategy, improved short‑lived Pod handling, and refined edge‑case accounting.

Performance improvements in CPI collection, SystemResourceCollector, BE pod eviction logic, RDT sandbox task ID support, etc.

Future Plans

Core Scheduling : Exploration of Linux Core Scheduling to enhance CPU QoS isolation (see Issue #1728).

Device Joint Allocation : Joint scheduling of GPUs and high‑performance NICs for large‑model training.

These features are targeted for the upcoming v1.5.0 milestone.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes resource scheduling mixed workloads Koordinator CPU normalization ElasticQuota

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Release v1.4.0 Overview

1. Mixed K8s & YARN Workloads

2. NUMA Topology Alignment

3. ElasticQuota Evolution

3.1 Multi‑QuotaTree

3.2 Non‑Preemptible Pods

3.3 Additional Improvements

4. CPU Normalization

5. Improved Rescheduling Protection

Sorting criteria

Filtering criteria

6. Cold‑Memory Reporting

7. QoS for Non‑Containerized Host Applications

8. Other Notable Features & Bugfixes

Future Plans

Alibaba Cloud Native

How this landed with the community

Was this worth your time?

0 Comments

Release v1.4.0 Overview

1. Mixed K8s & YARN Workloads