What’s New in Koordinator v1.4.0? A Deep Dive into Mixed‑Workload Scheduling and Resource Optimizations
Koordinator v1.4.0 introduces mixed K8s/YARN workloads, NUMA‑aware scheduling, CPU‑normalization, enhanced ElasticQuota with tree structures and non‑preemptible pods, cold‑memory reporting, QoS for non‑containerized applications, and a suite of bug‑fixes and performance improvements for enterprise Kubernetes clusters.
Background
Koordinator is an open‑source project launched in April 2022 that provides a comprehensive solution for mixed‑workload orchestration, resource scheduling, isolation, and performance tuning within Kubernetes clusters. The community includes contributors from Alibaba, Ant Finance, Intel, Xiaomi, and other enterprises.
Release v1.4.0 Overview
The v1.4.0 release adds several major capabilities:
Kubernetes + YARN mixed‑workload support
NUMA topology alignment policies
CPU normalization across heterogeneous nodes
Cold‑memory reporting
ElasticQuota enhancements (tree structure, multi‑quota trees, non‑preemptible pods, namespace annotations, overhead‑ignore feature)
Improved rescheduling protection for PodMigrationJob
QoS management for non‑containerized host applications
Various bug‑fixes and performance optimizations
1. Mixed K8s & YARN Workloads
Koordinator YARN Copilot enables Hadoop NodeManager to run inside a Kubernetes cluster, allowing offline Hadoop/YARN tasks to share nodes with containerized workloads. Key characteristics:
Built on the open‑source Hadoop YARN version without invasive changes
Unified resource priority and QoS policies between YARN and Kubernetes
Node‑level resource sharing so both Pods and YARN tasks can consume the same resources
2. NUMA Topology Alignment
High‑performance workloads (e.g., machine learning) often require CPU, GPU, and RDMA resources to reside on the same NUMA node. The native Kubelet Topology Manager lacks a global view, leading to scheduling failures. Koordinator moves NUMA selection to the central scheduler and currently offers an alpha implementation for both CPU and GPU resources.
Supported policies (set via a node label node.koordinator.sh/numa-topology-policy):
None : No alignment.
BestEffort : Allocate resources as long as total capacity satisfies the Pod.
Restricted : All resources must come from the same NUMA node(s); multiple nodes are allowed.
SingleNUMANode : Same as Restricted but restricts allocation to a single NUMA node.
Example node configuration:
apiVersion: v1
kind: Node
metadata:
labels:
node.koordinator.sh/numa-topology-policy: "SingleNUMANode"
name: node-0
spec:
...Scoring strategies for the NodeNUMAResource plugin can be set to LeastAllocated (default) or MostAllocated via the scheduler configuration:
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- name: NodeNUMAResource
args:
scoringStrategy:
type: MostAllocated
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
- name: "kubernetes.io/batch-cpu"
weight: 1
- name: "kubernetes.io/batch-memory"
weight: 1
numaScoringStrategy:
type: MostAllocated
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
- name: "kubernetes.io/batch-cpu"
weight: 1
- name: "kubernetes.io/batch-memory"
weight: 13. ElasticQuota Evolution
ElasticQuota now supports a hierarchical tree structure. The root quota is exposed as a CRD ( koordinator-root-quota) so users can view it directly.
3.1 Multi‑QuotaTree
Clusters with heterogeneous node types (e.g., amd64 vs. arm64) can be partitioned into separate quota trees. Users define an ElasticQuotaProfile that selects nodes via nodeSelector and binds them to a specific root quota.
apiVersion: quota.koordinator.sh/v1alpha1
kind: ElasticQuotaProfile
metadata:
labels:
kubernetes.io/arch: amd64
name: amd64-profile
namespace: kube-system
spec:
nodeSelector:
matchLabels:
kubernetes.io/arch: amd64
quotaName: amd64-root-quota
---
apiVersion: quota.koordinator.sh/v1alpha1
kind: ElasticQuotaProfile
metadata:
labels:
kubernetes.io/arch: arm64
name: arm64-profile
namespace: kube-system
spec:
nodeSelector:
matchLabels:
kubernetes.io/arch: arm64
quotaName: arm64-root-quota3.2 Non‑Preemptible Pods
Pods can declare quota.scheduling.koordinator.sh/preemptible: false to prevent them from being evicted by ElasticQuota borrowing mechanisms.
apiVersion: v1
kind: Pod
metadata:
name: pod-example
namespace: default
labels:
quota.scheduling.koordinator.sh/name: "quota-example"
quota.scheduling.koordinator.sh/preemptible: false
spec:
...3.3 Additional Improvements
Namespace‑scoped ElasticQuota via quota.scheduling.koordinator.sh/namespaces annotation (JSON array).
Optimized rebuild of the quota tree on changes.
Feature gate ElasticQuotaIgnorePodOverhead to ignore pod overhead in quota calculations.
4. CPU Normalization
Heterogeneous CPUs exhibit different performance. Koordinator introduces a normalization layer that scales the allocatable CPU on each node so that a “CPU unit” provides comparable compute power across architectures.
Normalization coefficients are configured in the slo-controller-config ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: slo-controller-config
namespace: koordinator-system
data:
cpu-normalization-config: |
{
"enable": true,
"ratioModel": {
"Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz": {
"baseRatio": 1.29,
"hyperThreadEnabledRatio": 0.82,
"turboEnabledRatio": 1.52,
"hyperThreadTurboEnabledRatio": 1.0
},
"Intel Xeon Platinum 8369B CPU @ 2.90GHz": {
"baseRatio": 1.69,
"hyperThreadEnabledRatio": 1.06,
"turboEnabledRatio": 1.91,
"hyperThreadTurboEnabledRatio": 1.20
}
}
}
# ...Koordinator’s webhook intercepts Kubelet updates to Node.Status.Allocatable and rewrites the CPU values according to the configured ratios.
5. Improved Rescheduling Protection
PodMigrationJob now uses an arbitration mechanism that sorts and filters jobs before execution.
Sorting criteria
Shortest time since migration start.
Lower Pod priority ranks higher.
Jobs from the same workload are grouped together.
Jobs belonging to workloads that already have active migrations rank higher.
Filtering criteria
Group by workload, node, namespace and apply limits.
Reject jobs if the number of running migrations for a workload exceeds a threshold.
Reject if the number of unavailable replicas exceeds the allowed maximum.
Reject if the target node already exceeds its maximum concurrent migrations.
6. Cold‑Memory Reporting
Koordinator adds a collector plugin that reads memory.idle_stat exported by kidled, kstaled, or DAMON cgroup files. The plugin stores hot‑page and cold‑page metrics in NodeMetric CRDs, which can be consumed by the cold‑memory reclamation feature (future work).
Supported collection strategies: usageWithHotPageCache, usageWithoutPageCache, usageWithPageCache.
7. QoS for Non‑Containerized Host Applications
Koordinator provides resource reservation for legacy host processes, exposing two QoS classes:
LS (Latency Sensitive) : CPU QoS Group Identity and full CPUSet allocation.
BE (Best‑effort) : CPU QoS Group Identity only.
Users must place processes into the appropriate cgroup; Koordinator does not yet automate cgroup migration.
8. Other Notable Features & Bugfixes
RequiredCPUBindPolicy : Strict CPU binding enforcement.
CI/CD : New e2e test pipeline and ARM64 images.
Batch resource calculation : Added maxUsageRequest strategy, improved short‑lived Pod handling, and refined edge‑case accounting.
Performance improvements in CPI collection, SystemResourceCollector, BE pod eviction logic, RDT sandbox task ID support, etc.
Future Plans
Core Scheduling : Exploration of Linux Core Scheduling to enhance CPU QoS isolation (see Issue #1728).
Device Joint Allocation : Joint scheduling of GPUs and high‑performance NICs for large‑model training.
These features are targeted for the upcoming v1.5.0 milestone.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
