How Koordinator Improves Efficiency and Stability for Cloud‑Native Mixed Workloads
This article explains how Alibaba Cloud's open‑source Koordinator system tackles mixed‑workload challenges by introducing priority and QoS models, resource overcommit, load‑aware scheduling, fine‑grained CPU orchestration, and upcoming features such as GPU scheduling and resource recommendation, all illustrated with architecture diagrams and code examples.
Background and Motivation
In April 2022 Alibaba Cloud released the open‑source Koordinator project, which has since evolved through four versions to help enterprises improve the efficiency, stability, and cost of running mixed online and offline workloads on Kubernetes clusters. Mixed‑workload (or "co‑location") refers to deploying multiple container types—online services and batch jobs—on the same node or across a single cluster to increase overall resource utilization.
Data‑center utilization historically low (average <10% in 2011) and the rapid growth of big‑data workloads drive the need for better resource management. Surveys show that >77% of enterprises plan to migrate half of their big‑data applications to Kubernetes by the end of 2021, making mixed‑workload placement a common practice.
Koordinator Architecture
Koordinator extends the native Kubernetes control plane with two main dimensions: a central control layer (scheduler extensions, SLO‑Controller, Recommender, Colocation Profile Webhook) and a node‑side layer (Koordlet and Koord Runtime Proxy) that handle fine‑grained resource management and QoS enforcement.
Key Mechanisms
Priority Model : Four levels (Product, Mid, Batch, Free) defined via standard PriorityClass, with resource capacity reported as extended resources on nodes.
QoS Model : Three QoS classes (System, Latency Sensitive, Best Effort) with sub‑classes for latency‑sensitive workloads, applied via pod annotations.
Resource Overcommit : Reclaims unused CPU/memory from online pods and reallocates it to lower‑priority batch jobs. Example node status and pod annotations are shown below.
# node info
allocatable:
koordinator.sh/batch-cpu: 50k # milli‑core
koordinator.sh/batch-memory: 50Gi
# pod info
annotations:
koordinator.sh/resource-limit: {cpu: "5k"}
resources:
requests:
koordinator.sh/batch-cpu: 5k
koordinator.sh/batch-memory: 5GiLoad‑Aware Scheduling : Scheduler plugin filters nodes with high load and prefers nodes with lower utilization, using metrics reported by Koordlet.
ClusterColocationProfile CRD : Enables one‑click activation of co‑location for selected namespaces or workloads via a mutating webhook.
apiVersion: config.koordinator.sh/v1alpha1
kind: ClusterColocationProfile
metadata:
name: colocation-profile-example
spec:
namespaceSelector:
matchLabels:
koordinator.sh/enable-colocation: "true"
selector:
matchLabels:
sparkoperator.k8s.io/launched-by-spark-operator: "true"
qosClass: BE
priorityClassName: koord-batch
koordinatorPriority: 1000
schedulerName: koord-scheduler
labels:
koordinator.sh/mutated: "true"
annotations:
koordinator.sh/intercepted: "true"
patch:
spec:
terminationGracePeriodSeconds: 30Applying the profile and labeling a namespace enables Spark jobs submitted via Spark Operator to be automatically co‑located with latency‑sensitive pods.
$ kubectl apply -f profile.yaml
$ kubectl label ns spark-job koordinator.sh/enable-colocation=true
$ # submit Spark job; Pods created by SparkOperator will be co‑located.QoS Enhancements
CPU Suppress : Dynamically shares idle CPU from online pods with batch pods, throttling batch pods when online load rises.
Resource‑Satisfaction Eviction : Evicts low‑priority batch pods when their CPU satisfaction ratio falls below a threshold and utilization exceeds 90%.
CPU Burst : Accumulates unused CPU credits and allows batch pods to burst when needed, reducing tail latency.
Group Identity : Uses kernel‑level group identity to give online pods priority over batch pods sharing the same physical core.
Memory QoS : Adjusts cgroup memory settings to protect node stability while improving memory‑sensitive workloads.
Fine‑Grained CPU Orchestration
Koordinator introduces detailed CPU orchestration policies (e.g., SameCore, Spread) tailored to the three LS sub‑classes (LSE, LSR, LS). These policies are compatible with Kubernetes CPUManager and NUMA Topology Manager, allowing safe gradual adoption.
Resource Reservation
The upcoming Reservation CRD lets users pre‑allocate resources for anticipated spikes, scaling events, or safe re‑scheduling, without modifying existing Kubernetes APIs.
kind: Reservation
metadata:
name: my-reservation
namespace: default
spec:
template: ... # copy of the Pod spec
resourceOwners:
controller:
apiVersion: apps/v1
kind: Deployment
name: deployment-5b8df84dd
timeToLiveInSeconds: 300
nodeName: node-1
status:
phase: AvailableFuture Roadmap
Version 0.5 will add fine‑grained CPU orchestration and resource reservation. Planned features for later releases include GPU scheduling, gang scheduling, elastic quota, and a profile‑based resource recommendation engine that analyzes historical usage to suggest optimal request/limit settings.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
