Mixed Workload Scheduling (混部) in Kubernetes: Challenges, Core Technologies, and Koordinator Enhancements
The article analyzes low CPU utilization in pure online Kubernetes clusters, introduces mixed‑workload (online + offline) scheduling to improve resource efficiency, explains core techniques, kernel QoS features, and details Koordinator‑based implementations such as node resource reservation and scheduling adjustments.
Problem & Thoughts
We observed that pure‑online business clusters have low average utilization, with CPU usage often below 10%. The main causes are coarse resource estimation, workload burstiness, low cluster integration, and isolated deployments.
Company Situation
As our Kubernetes scale grows, overall cluster utilization remains low, leading to serious resource waste. To address cost reduction and efficiency, we introduced a mixed‑workload (混部) feature that allows idle resources from online clusters to be used by offline tasks.
We classify applications into two types: online services and offline jobs.
Online and offline workloads complement each other in time and tolerance, allowing online services to retain priority while offline jobs tolerate preemption.
Mixed‑Workload Definition
From a cluster perspective, mixed‑workload deploys multiple application types in the same cluster, using predictive analysis to smooth resource peaks and valleys, thereby improving utilization.
From a node perspective, it places both online and offline containers on the same node.
Overall, mixing online services with offline tasks on shared physical resources, while applying isolation and scheduling controls, constitutes the “mixed‑workload” technique.
Problems to Solve
Cluster management: increase utilization, reduce IT cost, and provide clear insight into resource allocation.
Online services: mitigate interference between co‑located containers, manage differing sensitivity to resource contention, and avoid latency spikes.
Offline jobs: enable reliable oversubscription, meet differentiated QoS, and quickly detect interference sources.
Core Technologies
1. Node‑level granularity:
Container isolation (cgroup, RunC, etc.)
Node‑level scheduling (load awareness, policies, thresholds, priority, CPU share)
Central scheduler feedback and policies
Differentiated SLO settings for priority and resource guarantees
2. Cluster‑level granularity (built on node granularity):
High‑performance central scheduling, multi‑load coordination, GPU topology awareness
Node‑level task start‑stop control, hardware adaptation, CPU normalization
K8s ecosystem optimizations (scale, stability, ops support)
Kernel‑Supported QoS Features
CPU Competition & Isolation
GroupIdentity – uses cgroup groups to give special scheduling priority, protecting online services from offline‑induced delays.
SMTexpeller – prevents low‑priority hyper‑thread tasks from interfering with high‑priority workloads.
CPUBurst – elastic CPU control that allows short‑term bursts while keeping average usage low.
Memory Competition & Isolation
When a container’s memory (including page cache) approaches its limit, the kernel’s reclamation subsystem is triggered, affecting allocation performance.
Node‑level memory oversubscription (Memory Limit > Request) can cause global reclamation, heavily impacting performance.
Container memory fairness is provided via memcg; Anolis adds cgroup v1 support for this feature.
Koordinator
Koordinator (https://koordinator.sh) is a QoS‑aware scheduling system designed for efficient micro‑service, AI, and big‑data workloads on Kubernetes, increasing deployment density and reducing resource cost.
Key mechanisms:
Resource priority and QoS model for mixed‑workload scenarios
Stable oversubscription
Fine‑grained container resource orchestration and isolation
Enhanced scheduling for diverse workload types
Rapid onboarding of complex workloads
Koordinator Architecture
Advantages
Focus on mixed‑workload (scheduling, quota management, elasticity, cost control)
Zero‑intrusion, low‑cost integration for workloads and Kubernetes
Mixed‑Workload Feature Enhancements
We built on the open‑source Koordinator and added company‑specific features, such as node resource reservation.
Node Resource Reservation
Some legacy applications run as host processes alongside K8s containers. To avoid competition, we reserve a portion of CPU and memory that is excluded from the scheduler.
Reservation is declared via node annotations, for example:
apiVersion: v1
kind: Node
metadata:
name: fake-node
annotations: # specific 5 cores will be calculated, e.g. 0, 1, 2, 3, 4, and then those core will be reserved.
node.koordinator.sh/reservation: '{"resources":{"cpu":"5"}}'
---
apiVersion: v1
kind: Node
metadata:
name: fake-node
annotations: # the cores 0, 1, 2, 3 will be reserved.
node.koordinator.sh/reservation: '{"reservedCPUs":"0-3"}'The Koordlet component reports the reserved CPU IDs to the NodeResourceTopology object.
Scheduling & Rescheduling Adaptation
During allocation, the scheduler must subtract reserved resources from node capacity:
cpus(alloc) = cpus(total) - cpus(allocated) - cpus(kubeletReserved) - cpus(nodeAnnoReserved)For batch oversubscription, the calculation also excludes reserved resources and considers system usage:
reserveRatio = (100 - thresholdPercent) / 100.0
node.reserved = node.alloc * reserveRatio
system.used = max(node.used - pod.used, node.anno.reserved)
Node(BE).Alloc = Node.Alloc - Node.Reserved - System.Used - Pod(LS).UsedRescheduling plugins must be aware of reserved capacity and may need to evict containers that occupy reserved slots.
Single‑Node Resource Management
For LS‑type Pods, Koordlet dynamically computes a shared CPU pool while excluding reserved cores, ensuring isolation from non‑containerized processes. QoS policies such as CPUSuppress also factor in reserved resources:
suppress(BE) := node.Total * SLOPercent - pod(LS).Used - max(system.Used, node.anno.reserved)Other enhancements and bug fixes are tracked in the community pull‑request list.
Benefits and Future Plans
Deploying mixed‑workload technology has freed an additional 23,000 offline CPU cores for big‑data processing and noticeably improved CPU utilization on DB servers.
Future work includes extending QoS to network bandwidth, adding blkio considerations, enhancing scheduling/rescheduling with bandwidth‑aware load perception, and defining more QoS metrics.
We will continue collaborating with the open‑source community to drive these improvements.
360 Smart Cloud
Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.