Cloud Native 13 min read

Mixed Workload Scheduling (混部) in Kubernetes: Challenges, Core Technologies, and Koordinator Enhancements

The article analyzes low CPU utilization in pure online Kubernetes clusters, introduces mixed‑workload (online + offline) scheduling to improve resource efficiency, explains core techniques, kernel QoS features, and details Koordinator‑based implementations such as node resource reservation and scheduling adjustments.

360 Smart Cloud

Jan 10, 2024

Mixed Workload Scheduling (混部) in Kubernetes: Challenges, Core Technologies, and Koordinator Enhancements

Problem & Thoughts

We observed that pure‑online business clusters have low average utilization, with CPU usage often below 10%. The main causes are coarse resource estimation, workload burstiness, low cluster integration, and isolated deployments.

Company Situation

As our Kubernetes scale grows, overall cluster utilization remains low, leading to serious resource waste. To address cost reduction and efficiency, we introduced a mixed‑workload (混部) feature that allows idle resources from online clusters to be used by offline tasks.

We classify applications into two types: online services and offline jobs.

Online and offline workloads complement each other in time and tolerance, allowing online services to retain priority while offline jobs tolerate preemption.

Mixed‑Workload Definition

From a cluster perspective, mixed‑workload deploys multiple application types in the same cluster, using predictive analysis to smooth resource peaks and valleys, thereby improving utilization.

From a node perspective, it places both online and offline containers on the same node.

Overall, mixing online services with offline tasks on shared physical resources, while applying isolation and scheduling controls, constitutes the “mixed‑workload” technique.

Problems to Solve

Cluster management: increase utilization, reduce IT cost, and provide clear insight into resource allocation.

Online services: mitigate interference between co‑located containers, manage differing sensitivity to resource contention, and avoid latency spikes.

Offline jobs: enable reliable oversubscription, meet differentiated QoS, and quickly detect interference sources.

Core Technologies

1. Node‑level granularity:

Container isolation (cgroup, RunC, etc.)

Node‑level scheduling (load awareness, policies, thresholds, priority, CPU share)

Central scheduler feedback and policies

Differentiated SLO settings for priority and resource guarantees

2. Cluster‑level granularity (built on node granularity):

High‑performance central scheduling, multi‑load coordination, GPU topology awareness

Node‑level task start‑stop control, hardware adaptation, CPU normalization

K8s ecosystem optimizations (scale, stability, ops support)

Kernel‑Supported QoS Features

CPU Competition & Isolation

GroupIdentity – uses cgroup groups to give special scheduling priority, protecting online services from offline‑induced delays.

SMTexpeller – prevents low‑priority hyper‑thread tasks from interfering with high‑priority workloads.

CPUBurst – elastic CPU control that allows short‑term bursts while keeping average usage low.

Memory Competition & Isolation

When a container’s memory (including page cache) approaches its limit, the kernel’s reclamation subsystem is triggered, affecting allocation performance.

Node‑level memory oversubscription (Memory Limit > Request) can cause global reclamation, heavily impacting performance.

Container memory fairness is provided via memcg; Anolis adds cgroup v1 support for this feature.

Koordinator

Koordinator (https://koordinator.sh) is a QoS‑aware scheduling system designed for efficient micro‑service, AI, and big‑data workloads on Kubernetes, increasing deployment density and reducing resource cost.

Key mechanisms:

Resource priority and QoS model for mixed‑workload scenarios

Stable oversubscription

Fine‑grained container resource orchestration and isolation

Enhanced scheduling for diverse workload types

Rapid onboarding of complex workloads

Koordinator Architecture

Advantages

Focus on mixed‑workload (scheduling, quota management, elasticity, cost control)

Zero‑intrusion, low‑cost integration for workloads and Kubernetes

Mixed‑Workload Feature Enhancements

We built on the open‑source Koordinator and added company‑specific features, such as node resource reservation.

Node Resource Reservation

Some legacy applications run as host processes alongside K8s containers. To avoid competition, we reserve a portion of CPU and memory that is excluded from the scheduler.

Reservation is declared via node annotations, for example:

apiVersion: v1<br/>kind: Node<br/>metadata:<br/>  name: fake-node<br/>  annotations: # specific 5 cores will be calculated, e.g. 0, 1, 2, 3, 4, and then those core will be reserved.<br/>    node.koordinator.sh/reservation: '{"resources":{"cpu":"5"}}'<br/>---<br/>apiVersion: v1<br/>kind: Node<br/>metadata:<br/>  name: fake-node<br/>  annotations: # the cores 0, 1, 2, 3 will be reserved.<br/>    node.koordinator.sh/reservation: '{"reservedCPUs":"0-3"}'

The Koordlet component reports the reserved CPU IDs to the NodeResourceTopology object.

Scheduling & Rescheduling Adaptation

During allocation, the scheduler must subtract reserved resources from node capacity:

cpus(alloc) = cpus(total) - cpus(allocated) - cpus(kubeletReserved) - cpus(nodeAnnoReserved)

For batch oversubscription, the calculation also excludes reserved resources and considers system usage:

reserveRatio = (100 - thresholdPercent) / 100.0<br/>node.reserved = node.alloc * reserveRatio<br/>system.used = max(node.used - pod.used, node.anno.reserved)<br/>Node(BE).Alloc = Node.Alloc - Node.Reserved - System.Used - Pod(LS).Used

Rescheduling plugins must be aware of reserved capacity and may need to evict containers that occupy reserved slots.

Single‑Node Resource Management

For LS‑type Pods, Koordlet dynamically computes a shared CPU pool while excluding reserved cores, ensuring isolation from non‑containerized processes. QoS policies such as CPUSuppress also factor in reserved resources:

suppress(BE) := node.Total * SLOPercent - pod(LS).Used - max(system.Used, node.anno.reserved)

Other enhancements and bug fixes are tracked in the community pull‑request list.

Benefits and Future Plans

Deploying mixed‑workload technology has freed an additional 23,000 offline CPU cores for big‑data processing and noticeably improved CPU utilization on DB servers.

Future work includes extending QoS to network bandwidth, adding blkio considerations, enhancing scheduling/rescheduling with bandwidth‑aware load perception, and defining more QoS metrics.

We will continue collaborating with the open‑source community to drive these improvements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Kubernetes Resource Scheduling QoS Mixed Workload Koordinator

Written by

360 Smart Cloud

Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.