Explore Koordinator v1.1: Load‑Aware Scheduling, cgroup v2, and Descheduler Updates
Koordinator v1.1 introduces load‑aware scheduling with workload‑type awareness, percentile‑based resource aggregation, cgroup v2 support, a new LowNodeLoad descheduler plugin for load‑aware rebalancing, expanded performance collectors, ServiceMonitor integration, and detailed configuration examples, aiming to improve latency‑sensitive workloads and overall cluster resource efficiency.
Background
Koordinator provides mixed‑workload orchestration, resource scheduling, isolation and performance tuning for cloud‑native clusters. Version 1.1 adds load‑aware scheduling, load‑aware re‑scheduling, cgroup v2 support, interference detection collectors and Prometheus ServiceMonitor integration.
Load‑Aware Scheduling Enhancements
Workload‑type thresholds
Configuration fields: prodUsageThresholds – safety thresholds for online (Prod) pods. scoreAccordingProdUsage – optional flag to score nodes based only on Prod utilization.
During the Filter phase, if a pod has priorityClassName: "koord-prod", the scheduler sums the utilization of all Prod pods on a node and filters the node when the sum exceeds prodUsageThresholds. Batch pods continue to use whole‑node utilization.
During the Score phase, enabling scoreAccordingProdUsage makes the score calculation use only Prod utilization; otherwise whole‑node utilization is used.
Percentile‑based resource aggregation
Aggregated usage can be evaluated by percentile (p99, p95, p90, p50) instead of average. Example configuration:
aggregated:
usageThresholds:
cpu: 65
memory: 75
usageAggregationType: "p99"
scoreAggregationType: "p99"
usageAggregatedDuration: "5m"
scoreAggregatedDuration: "5m"If aggregated.usageThresholds and an aggregation type are set, the scheduler filters nodes using the selected percentile value; the same applies to scoring.
Load‑Aware Re‑scheduling (LowNodeLoad)
The new descheduler plugin LowNodeLoad evicts pods from nodes whose utilization exceeds a high threshold and places them on nodes below a low threshold. highThresholds – safety water‑mark; nodes above this are considered hotspots. lowThresholds – idle water‑mark; nodes below this are safe destinations.
Nodes are classified as Idle (< low), Normal (between), or Hotspot (> high). The plugin respects optional namespace and label filters and performs capacity checks before migration.
cgroup v2 Support
Koordlet now works with Linux cgroup v2. A refactored ResourceExecutor abstracts file operations for both cgroup versions. Example code for reading a pod’s CPU set and updating its CFS quota:
var (
cgroupReader = resourceexecutor.NewCgroupReader()
executor = resourceexecutor.NewResourceUpdateExecutor()
)
func readPodCPUSet(podMeta *statesinformer.PodMeta) (string, error) {
podParentDir := koordletutil.GetPodCgroupDirWithKube(podMeta.CgroupDir)
cpus, err := cgroupReader.ReadCPUSet(podParentDir)
if err != nil {
return "", err
}
return cpus.String(), nil
}
func updatePodCFSQuota(podMeta *statesinformer.PodMeta, cfsQuotaValue int64) error {
podDir := koordletutil.GetPodCgroupDirWithKube(podMeta.CgroupDir)
cfsQuotaStr := strconv.FormatInt(cfsQuotaValue, 10)
updater, err := resourceexecutor.DefaultCgroupUpdaterFactory.New(system.CPUCFSQuotaName, podDir, cfsQuotaStr)
if err != nil {
return err
}
_, err = executor.Update(true, updater)
return err
}Performance Collectors & Interference Detection
Optional collectors gated by feature flags: CPICollector – collects Cycles‑Per‑Instruction metrics. PSICollector – collects Pressure Stall Information.
Metrics are exposed via Prometheus. Example metric:
# HELP koordlet_container_cpi Container cpi collected by koordlet
# TYPE koordlet_container_cpi gauge
koordlet_container_cpi{container_id="containerd://...",container_name="koordlet",cpi_field="cycles",node="node1",pod_name="koordlet-xyz",pod_namespace="koordinator-system",pod_uid="..."} 2.228e+09ServiceMonitor Integration
Setting koordlet.enableServiceMonitor=true creates a ServiceMonitor so Prometheus can scrape the metrics.
apiVersion: v1
kind: Service
metadata:
name: koordlet
namespace: koordinator-system
spec:
ports:
- name: koordlet-service
port: 9316
targetPort: 9316
selector:
koord-app: koordlet
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: koordlet
namespace: koordinator-system
spec:
endpoints:
- interval: 30s
port: koordlet-service
scheme: http
selector:
matchLabels:
koord-app: koordletConfiguration Examples
Scheduler ConfigMap enabling the new features:
apiVersion: v1
kind: ConfigMap
metadata:
name: koord-scheduler-config
namespace: koordinator-system
data:
koord-scheduler-config: |
apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: koord-scheduler
plugins:
filter:
enabled:
- name: LoadAwareScheduling
score:
enabled:
- name: LoadAwareScheduling
weight: 1
pluginConfig:
- name: LoadAwareScheduling
args:
prodUsageThresholds:
cpu: 55
memory: 65
scoreAccordingProdUsage: true
aggregated:
usageThresholds:
cpu: 65
memory: 75
usageAggregationType: "p99"
scoreAggregationType: "p99"Descheduler ConfigMap enabling LowNodeLoad:
apiVersion: v1
kind: ConfigMap
metadata:
name: koord-descheduler-config
namespace: koordinator-system
data:
koord-descheduler-config: |
apiVersion: descheduler/v1alpha2
kind: DeschedulerConfiguration
deschedulingInterval: 60s
profiles:
- name: koord-descheduler
plugins:
balance:
enabled:
- name: LowNodeLoad
pluginConfig:
- name: LowNodeLoad
args:
lowThresholds:
cpu: 20
memory: 30
highThresholds:
cpu: 50
memory: 60
evictableNamespaces:
exclude:
- "kube-system"
- "koordinator-system"Demo Workflow
Deploy a stress‑test pod (example Deployment manifest).
Deploy several Prod‑type nginx pods with priorityClassName: "koord-prod" and schedulerName: koord-scheduler.
Observe node utilization with kubectl top node and verify that Prod pods avoid overloaded nodes.
Enable the LowNodeLoad plugin and watch the descheduler evict pods from hotspot nodes to idle nodes.
Future Plans
The community plans to extend mixed‑workload support to additional big‑data frameworks, enrich interference detection metrics (memory, disk I/O) and continue standardising mixed‑workload capabilities across vendors.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
