Mastering kube-scheduler: How Kubernetes Schedules Pods Efficiently
This article explains how kube-scheduler in Kubernetes orchestrates pod placement by applying pre‑selection (predicates) and scoring (priorities) strategies, discusses fairness, resource efficiency, speed, flexibility, details common predicate and priority algorithms, and demonstrates practical scenarios with YAML and command‑line examples.
Introduction
Scheduling is a critical step in container orchestration. The kube-scheduler component of Kubernetes ensures that Pods are placed on suitable nodes while meeting various production constraints such as dedicated machines for certain services or disaster‑recovery distribution across nodes.
kube-scheduler acts as a caretaker, providing scheduling services for Pods based on mechanisms like resource‑fair scheduling, binding Pods to specific nodes, or co‑locating frequently communicating Pods.
The scheduler must achieve several goals:
Fairness – each node should have a chance to receive resources.
Efficient resource utilization – maximize CPU, memory, etc., across the cluster.
Performance – quickly schedule large numbers of Pods even as the cluster scales.
Flexibility – allow users to control scheduling policies, support multiple schedulers, and enable custom schedulers.
To meet these goals, kube-scheduler evaluates node resources, load, data locality, and other factors, influencing the overall availability and performance of a Kubernetes cluster, especially when thousands of nodes are involved.
Scheduling Process
The core task of kube-scheduler is to bind a Pod to the most appropriate node. The process consists of two stages: Predicates (pre‑selection) and Priorities (scoring) .
Predicates (Pre‑selection)
Input: all nodes. Output: nodes that satisfy pre‑selection conditions. Nodes that fail conditions such as insufficient resources or mismatched labels are filtered out.
Priorities (Scoring)
Input: nodes that passed the predicate stage. Each node receives a score based on priority functions; the node with the highest total score is selected.
In simple terms, scheduling answers two questions: 1) Which nodes are candidates? 2) Which candidate is the best?
If no node satisfies the predicates, the Pod remains in Pending state and the scheduler keeps retrying.
Predicate Strategies
kube-scheduler supports many predicate algorithms. Common ones include:
Volume count limits : MaxEBSVolumeCount, MaxGCEPDVolumeCount, MaxAzureDiskVolumeCount – ensure the number of attached volumes does not exceed a configured maximum.
Resource pressure checks : CheckNodeMemoryPressure, CheckNodeDiskPressure – prevent scheduling to nodes under memory or disk pressure.
Volume conflict checks : NoDiskConflict, NoVolumeZoneConflict, NoVolumeNodeConflict – avoid placing Pods that use the same volume on the same node.
Constraint checks : MatchNodeSelector, MatchInterPodAffinity, PodToleratesNodeTaints – verify node labels, pod affinity, and taint‑toleration relationships.
Fit checks : PodFitsResources, PodFitsHostPorts, PodFitsHost – ensure sufficient CPU/memory, free host ports, and matching node name.
Priority Strategies
During the priority phase, each node receives a score from 0‑10 for each priority function; the final score is the weighted sum of all functions.
LeastRequestedPriority (default weight 1): prefers nodes with the smallest amount of requested CPU and memory.
BalancedResourceAllocation (default weight 1): gives higher weight to nodes where CPU and memory usage are balanced.
SelectorSpreadPriority (default weight 1): spreads Pods of the same Service or ReplicationController across different nodes or zones.
NodeAffinityPriority (default weight 1): prefers nodes whose labels match the Pod's affinity requirements, similar to the MatchNodeSelector predicate.
InterPodAffinityPriority (default weight 1): adds weight based on existing Pods' affinity on a node; the node with the highest total weight wins.
NodePreferAvoidPodsPriority (default weight 10000): heavily penalizes nodes that should avoid certain Pods, effectively overriding other scores.
TaintTolerationPriority (default weight 1): scores nodes based on how many taints a Pod can tolerate.
ImageLocalityPriority (default weight 1): prefers nodes that already have the required container images cached.
EqualPriority (default weight 1): gives all nodes equal weight, mainly for testing.
MostRequestedPriority (default weight 1): prefers nodes with the highest resource utilization, useful for scaling scenarios.
Case Studies
Scenario 1 – Schedule to SSD‑backed Nodes
Use the MatchNodeSelector predicate to label an SSD node and add the same label to the Redis Deployment.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: redis-master
labels:
name: redis
namespace: default
spec:
replicas: 4
template:
metadata:
labels:
name: redis
spec:
containers:
- name: master
image: 172.16.1.41:5000/redis:3.0.5
resources:
requests:
cpu: 100m
memory: 100MiLabel the node: kubectl label node transwarp disk=ssd Patch the Deployment to add the node selector:
kubectl patch deploy redis-master -p '{"spec":{"template":{"spec":{"nodeSelector":{"disk":"ssd"}}}}}'Scenario 2 – Restrict to CentOS Nodes
Apply NodeAffinityPriority with a required selector for operation system values centos 7.2 or centos 7.3.
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: operation system
operator: In
values:
- centos 7.2
- centos 7.3Optionally add a preferred rule to favor centos 7.2 nodes.
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: another-node-label-key-system
operator: In
values:
- another-node-label-value-centos7.2Scenario 3 – Co‑locate Pods in the Same Zone
Use InterPodAffinityPriority so that an API service and its authentication service run in the same availability zone.
apiVersion: v1
kind: Pod
metadata:
name: pod-flag
labels:
security: "cloud"
spec:
containers:
- name: nginx
image: nginxAuthentication service YAML adds required pod affinity and a topology key:
apiVersion: v1
kind: Pod
metadata:
name: with-pod-affinity
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: security
operator: In
values:
- cloud
topologyKey: failure-domain.beta.kubernetes.io/zoneScenario 4 – Isolate GPU‑Intensive Workloads
Mark GPU nodes with a taint and add a matching toleration to Pods that require GPU resources.
kubectl taint nodes node1 gpu=true:NoSchedule apiVersion: v1
kind: Pod
metadata:
generateName: redis-
labels:
app: redis
namespace: default
spec:
containers:
- image: 172.16.1.41:5000/redis
imagePullPolicy: Always
name: redis
schedulerName: default-scheduler
tolerations:
- effect: NoSchedule
key: gpu
operator: Equal
value: trueConclusion and Outlook
kube-scheduler already offers a rich set of scheduling strategies that satisfy most common needs. Its plugin architecture enables users to customize or extend the scheduler for special resources such as local volumes or GPUs. Future improvements include caching to reduce repeated calculations in the predicate and priority phases, scheduler‑extender extensions for more nuanced resource handling, and integration with custom‑metrics APIs for real‑time scheduling decisions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
StarRing Big Data Open Lab
Focused on big data technology research, exploring the Big Data era | [email protected]
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
