Cloud Native 25 min read

Explore Koordinator v1.1: Load‑Aware Scheduling, cgroup v2, and Descheduler Updates

Koordinator v1.1 introduces load‑aware scheduling with workload‑type awareness, percentile‑based resource aggregation, cgroup v2 support, a new LowNodeLoad descheduler plugin for load‑aware rebalancing, expanded performance collectors, ServiceMonitor integration, and detailed configuration examples, aiming to improve latency‑sensitive workloads and overall cluster resource efficiency.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Explore Koordinator v1.1: Load‑Aware Scheduling, cgroup v2, and Descheduler Updates

Background

Koordinator provides mixed‑workload orchestration, resource scheduling, isolation and performance tuning for cloud‑native clusters. Version 1.1 adds load‑aware scheduling, load‑aware re‑scheduling, cgroup v2 support, interference detection collectors and Prometheus ServiceMonitor integration.

Load‑Aware Scheduling Enhancements

Workload‑type thresholds

Configuration fields: prodUsageThresholds – safety thresholds for online (Prod) pods. scoreAccordingProdUsage – optional flag to score nodes based only on Prod utilization.

During the Filter phase, if a pod has priorityClassName: "koord-prod", the scheduler sums the utilization of all Prod pods on a node and filters the node when the sum exceeds prodUsageThresholds. Batch pods continue to use whole‑node utilization.

During the Score phase, enabling scoreAccordingProdUsage makes the score calculation use only Prod utilization; otherwise whole‑node utilization is used.

Percentile‑based resource aggregation

Aggregated usage can be evaluated by percentile (p99, p95, p90, p50) instead of average. Example configuration:

aggregated:
  usageThresholds:
    cpu: 65
    memory: 75
  usageAggregationType: "p99"
  scoreAggregationType: "p99"
  usageAggregatedDuration: "5m"
  scoreAggregatedDuration: "5m"

If aggregated.usageThresholds and an aggregation type are set, the scheduler filters nodes using the selected percentile value; the same applies to scoring.

Load‑Aware Re‑scheduling (LowNodeLoad)

The new descheduler plugin LowNodeLoad evicts pods from nodes whose utilization exceeds a high threshold and places them on nodes below a low threshold. highThresholds – safety water‑mark; nodes above this are considered hotspots. lowThresholds – idle water‑mark; nodes below this are safe destinations.

Nodes are classified as Idle (< low), Normal (between), or Hotspot (> high). The plugin respects optional namespace and label filters and performs capacity checks before migration.

cgroup v2 Support

Koordlet now works with Linux cgroup v2. A refactored ResourceExecutor abstracts file operations for both cgroup versions. Example code for reading a pod’s CPU set and updating its CFS quota:

var (
    cgroupReader = resourceexecutor.NewCgroupReader()
    executor     = resourceexecutor.NewResourceUpdateExecutor()
)

func readPodCPUSet(podMeta *statesinformer.PodMeta) (string, error) {
    podParentDir := koordletutil.GetPodCgroupDirWithKube(podMeta.CgroupDir)
    cpus, err := cgroupReader.ReadCPUSet(podParentDir)
    if err != nil {
        return "", err
    }
    return cpus.String(), nil
}

func updatePodCFSQuota(podMeta *statesinformer.PodMeta, cfsQuotaValue int64) error {
    podDir := koordletutil.GetPodCgroupDirWithKube(podMeta.CgroupDir)
    cfsQuotaStr := strconv.FormatInt(cfsQuotaValue, 10)
    updater, err := resourceexecutor.DefaultCgroupUpdaterFactory.New(system.CPUCFSQuotaName, podDir, cfsQuotaStr)
    if err != nil {
        return err
    }
    _, err = executor.Update(true, updater)
    return err
}

Performance Collectors & Interference Detection

Optional collectors gated by feature flags: CPICollector – collects Cycles‑Per‑Instruction metrics. PSICollector – collects Pressure Stall Information.

Metrics are exposed via Prometheus. Example metric:

# HELP koordlet_container_cpi Container cpi collected by koordlet
# TYPE koordlet_container_cpi gauge
koordlet_container_cpi{container_id="containerd://...",container_name="koordlet",cpi_field="cycles",node="node1",pod_name="koordlet-xyz",pod_namespace="koordinator-system",pod_uid="..."} 2.228e+09

ServiceMonitor Integration

Setting koordlet.enableServiceMonitor=true creates a ServiceMonitor so Prometheus can scrape the metrics.

apiVersion: v1
kind: Service
metadata:
  name: koordlet
  namespace: koordinator-system
spec:
  ports:
  - name: koordlet-service
    port: 9316
    targetPort: 9316
  selector:
    koord-app: koordlet
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: koordlet
  namespace: koordinator-system
spec:
  endpoints:
  - interval: 30s
    port: koordlet-service
    scheme: http
  selector:
    matchLabels:
      koord-app: koordlet

Configuration Examples

Scheduler ConfigMap enabling the new features:

apiVersion: v1
kind: ConfigMap
metadata:
  name: koord-scheduler-config
  namespace: koordinator-system
data:
  koord-scheduler-config: |
    apiVersion: kubescheduler.config.k8s.io/v1beta2
    kind: KubeSchedulerConfiguration
    profiles:
    - schedulerName: koord-scheduler
      plugins:
        filter:
          enabled:
          - name: LoadAwareScheduling
        score:
          enabled:
          - name: LoadAwareScheduling
            weight: 1
      pluginConfig:
      - name: LoadAwareScheduling
        args:
          prodUsageThresholds:
            cpu: 55
            memory: 65
          scoreAccordingProdUsage: true
          aggregated:
            usageThresholds:
              cpu: 65
              memory: 75
            usageAggregationType: "p99"
            scoreAggregationType: "p99"

Descheduler ConfigMap enabling LowNodeLoad:

apiVersion: v1
kind: ConfigMap
metadata:
  name: koord-descheduler-config
  namespace: koordinator-system
data:
  koord-descheduler-config: |
    apiVersion: descheduler/v1alpha2
    kind: DeschedulerConfiguration
    deschedulingInterval: 60s
    profiles:
    - name: koord-descheduler
      plugins:
        balance:
          enabled:
          - name: LowNodeLoad
      pluginConfig:
      - name: LowNodeLoad
        args:
          lowThresholds:
            cpu: 20
            memory: 30
          highThresholds:
            cpu: 50
            memory: 60
          evictableNamespaces:
            exclude:
            - "kube-system"
            - "koordinator-system"

Demo Workflow

Deploy a stress‑test pod (example Deployment manifest).

Deploy several Prod‑type nginx pods with priorityClassName: "koord-prod" and schedulerName: koord-scheduler.

Observe node utilization with kubectl top node and verify that Prod pods avoid overloaded nodes.

Enable the LowNodeLoad plugin and watch the descheduler evict pods from hotspot nodes to idle nodes.

Future Plans

The community plans to extend mixed‑workload support to additional big‑data frameworks, enrich interference detection metrics (memory, disk I/O) and continue standardising mixed‑workload capabilities across vendors.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

CloudNativeKubernetesSchedulercgroupPerformanceMetricsDeschedulerLoadAware
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.