Stop Pods From “Running Wild”: A Practical Guide to Kubernetes Scheduling Strategies
This guide explains why default Kubernetes scheduling often falls short in production, introduces nodeSelector, nodeAffinity, podAffinity/anti‑affinity, taints/tolerations, topologySpreadConstraints and PriorityClass, and provides step‑by‑step configuration examples, real‑world use cases, best‑practice recommendations, troubleshooting tips, and monitoring alerts to ensure reliable pod placement.
Overview
Pod scheduling is a core mechanism of Kubernetes that decides on which node a Pod will run. The default kube‑scheduler uses a two‑stage process of filtering (pre‑selection) and scoring (ranking). Production issues often include database Pods landing on nodes without SSDs, high‑load services sharing a node, and GPU Pods being scheduled on regular nodes, causing resource starvation.
These problems are solved by applying scheduling policies such as nodeSelector, nodeAffinity, podAffinity/podAntiAffinity, taints/tolerations, and topologySpreadConstraints. The following sections show how to configure each mechanism.
Technical characteristics
Multi‑layer scheduling control : from simple nodeSelector to custom schedulers, offering different granularity.
Soft and hard constraints combined : requiredDuringScheduling is hard, preferredDuringScheduling is soft.
Topology awareness : topologySpreadConstraints enables distribution across zones, nodes, racks, etc.
Preemption mechanism : PriorityClass allows high‑priority Pods to preempt lower‑priority ones.
Applicable scenarios
Schedule I/O‑intensive Pods (databases, caches) to SSD nodes and CPU‑intensive Pods to high‑CPU nodes.
Distribute multiple replicas of the same service across different nodes or availability zones.
Exclusive scheduling of special hardware (GPU, FPGA) to prevent ordinary Pods from occupying dedicated resources.
Isolate different teams in a multi‑tenant cluster by assigning each team its own node pool.
Environment requirements
Component Version Requirement Notes
Kubernetes 1.24+ topologySpreadConstraints GA since 1.19, PodSecurity GA since 1.25
kube‑scheduler Must match cluster version Custom scheduler needs separate deployment
Node labels Planned in advance Scheduling policies depend on consistent labeling
metrics‑server >=0.6 Resource‑aware scheduling needs metrics dataStep‑by‑step guide
1. Prepare node labels
# View existing node labels
kubectl get nodes --show-labels
# Label nodes by hardware type
kubectl label node k8s-worker-01 disktype=ssd
kubectl label node k8s-worker-02 disktype=ssd
kubectl label node k8s-worker-03 disktype=hdd
# Label nodes by workload purpose
kubectl label node k8s-worker-01 workload-type=database
kubectl label node k8s-worker-02 workload-type=application
kubectl label node k8s-worker-03 workload-type=application
# Label nodes by availability zone (if multi‑datacenter)
kubectl label node k8s-worker-01 topology.kubernetes.io/zone=zone-a
kubectl label node k8s-worker-02 topology.kubernetes.io/zone=zone-b
kubectl label node k8s-worker-03 topology.kubernetes.io/zone=zone-c
# GPU node labels
kubectl label node k8s-gpu-01 accelerator=nvidia-tesla-v100
kubectl label node k8s-gpu-02 accelerator=nvidia-tesla-a100
# Verify labels
kubectl get nodes -L disktype,workload-type,topology.kubernetes.io/zoneNote Use a domain‑prefixed key for custom labels, e.g. company.com/team=backend. Built‑in labels use kubernetes.io or k8s.io prefixes.
2. Understand scheduler workflow
Filtering : discards nodes that lack required resources, mismatched nodeSelector, or unsatisfied taints.
Scoring : assigns a score to the remaining nodes and selects the highest‑scoring node.
# Enable verbose logging for the scheduler (add --v=4 to the manifest)
kubectl logs -n kube-system -l component=kube-scheduler --tail=503. Policy priority
nodeName(highest – direct node assignment, bypasses scheduler)
taints/tolerations nodeSelector nodeAffinity podAffinity/podAntiAffinity topologySpreadConstraintsResource requests
Core configurations
nodeSelector (simplest constraint)
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-ssd
spec:
replicas: 3
selector:
matchLabels:
app: nginx-ssd
template:
metadata:
labels:
app: nginx-ssd
spec:
nodeSelector:
disktype: ssd
containers:
- name: nginx
image: nginx:1.24
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256MiNote nodeSelector is a hard constraint; if no node matches, the Pod stays Pending. Combine with nodeAffinity for soft constraints.
nodeAffinity (more flexible)
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-with-affinity
spec:
replicas: 6
selector:
matchLabels:
app: app-affinity
template:
metadata:
labels:
app: app-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- zone-a
- zone-b
- key: disktype
operator: In
values:
- ssd
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
- weight: 20
preference:
matchExpressions:
- key: workload-type
operator: In
values:
- application
containers:
- name: app
image: nginx:1.24
resources:
requests:
cpu: 200m
memory: 256MiOperators supported: In, NotIn, Exists, DoesNotExist, Gt, Lt.
podAffinity / podAntiAffinity
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-frontend
spec:
replicas: 3
selector:
matchLabels:
app: web-frontend
template:
metadata:
labels:
app: web-frontend
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- redis-cache
topologyKey: kubernetes.io/hostname
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-frontend
topologyKey: kubernetes.io/hostname
containers:
- name: web
image: nginx:1.24
resources:
requests:
cpu: 200m
memory: 256MiWarning The calculation complexity of podAffinity/podAntiAffinity is O(N²) where N is the number of Pods. In clusters with >5,000 Pods, heavy use can increase scheduling latency from milliseconds to seconds.
Taints and tolerations
# Add a taint to GPU nodes so only GPU‑tolerant Pods can run there
kubectl taint nodes k8s-gpu-01 gpu=true:NoSchedule
kubectl taint nodes k8s-gpu-02 gpu=true:NoSchedule
# Add a maintenance taint to a node (drains existing Pods)
kubectl taint nodes k8s-worker-03 maintenance=true:NoExecute
# Verify taints
kubectl describe node k8s-gpu-01 | grep -A 5 Taints
# Remove a taint
kubectl taint nodes k8s-worker-03 maintenance=true:NoExecute-Pod‑side tolerations example:
apiVersion: batch/v1
kind: Job
metadata:
name: gpu-training-job
spec:
template:
spec:
tolerations:
- key: "gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
nodeSelector:
accelerator: nvidia-tesla-v100
containers:
- name: training
image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime
resources:
limits:
nvidia.com/gpu: 1
requests:
cpu: "4"
memory: "16Gi"
restartPolicy: NeverTaint effects: NoSchedule: new Pods are blocked; existing Pods are unaffected. PreferNoSchedule: scheduler tries to avoid the node but may place Pods if needed. NoExecute: new Pods are blocked and existing Pods that do not tolerate the taint are evicted after tolerationSeconds (if set).
topologySpreadConstraints (fine‑grained topology distribution)
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-spread
spec:
replicas: 9
selector:
matchLabels:
app: app-spread
template:
metadata:
labels:
app: app-spread
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: app-spread
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: app-spread
containers:
- name: app
image: nginx:1.24
resources:
requests:
cpu: 100m
memory: 128MiParameter notes maxSkew: maximum allowed difference in Pod count between any two topology domains; 1 enforces near‑perfect balance. whenUnsatisfiable: DoNotSchedule (hard) or ScheduleAnyway (soft). labelSelector must match the Pod’s own labels; otherwise the constraint is ignored.
Best practices and caveats
Performance optimisation
Reduce podAffinity usage: in a 500‑node cluster a pod with podAffinity can increase scheduling time from 5 ms to 200 ms. Prefer topologySpreadConstraints where possible.
Set realistic requests based on the 95th‑percentile of actual usage; the scheduler bases decisions on requests, not limits.
Use the Descheduler to evict pods that violate current policies after node scaling or migrations.
Security hardening
Restrict direct nodeName usage via RBAC to prevent bypassing scheduler checks.
Control creation of high‑priority PriorityClass objects; only privileged roles may create or use them.
Protect taint modifications with admission webhooks to avoid accidental removal of critical taints.
High‑availability patterns
Deploy at least three replicas of core services, enforce hard podAntiAffinity across nodes, and add topologySpreadConstraints across zones.
Use a PodDisruptionBudget (e.g., minAvailable: 2) to guarantee service continuity during node maintenance.
Store scheduling manifests (PriorityClass, PDB, node labels/taints) in a GitOps repository and apply with ArgoCD or FluxCD.
Common pitfalls
Multiple nodeSelectorTerms are OR‑ed; multiple matchExpressions inside a term are AND‑ed.
Hard podAntiAffinity with topologyKey=kubernetes.io/hostname limits replica count to the number of nodes.
Incorrect labelSelector in topologySpreadConstraints leads to uneven distribution.
Troubleshooting and monitoring
Debugging steps
Check scheduler logs:
kubectl logs -n kube-system -l component=kube-scheduler --tail=100Inspect Pod events: kubectl describe pod <pod> | grep -A 10 Events Verify node labels: kubectl get nodes --show-labels Validate manifests with dry‑run:
kubectl apply --dry-run=server -f <file>.yamlKey metrics
Scheduler latency (P99): scheduler_scheduling_algorithm_duration_seconds Pending Pods count: scheduler_pending_pods Preemption events: scheduler_preemption_victims Node CPU/Memory utilisation:
kubectl top nodesPrometheus alerts (example)
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: scheduler-alerts
spec:
groups:
- name: kube-scheduler
rules:
- alert: SchedulerHighLatency
expr: histogram_quantile(0.99, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket[5m])) by (le)) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "Scheduler P99 latency exceeds 500ms"
- alert: PodsPendingTooLong
expr: sum(scheduler_pending_pods{queue=\"active\"}) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "More than 10 pods pending for over 5 minutes"
- alert: SchedulerUnhealthy
expr: absent(up{job=\"kube-scheduler\"} == 1)
for: 3m
labels:
severity: critical
annotations:
summary: "kube-scheduler is not running"
- alert: NodeHighAllocation
expr: (1 - sum(kube_node_status_allocatable{resource=\"cpu\"} - kube_pod_container_resource_requests{resource=\"cpu\"}) by (node) / sum(kube_node_status_allocatable{resource=\"cpu\"}) by (node) > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.node }} CPU allocation exceeds 85%"
- alert: FrequentPreemption
expr: increase(scheduler_preemption_victims[1h]) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "More than 5 preemption events in the last hour"Backup and restore
Backup script (example)
#!/bin/bash
set -euo pipefail
BACKUP_DIR="/data/scheduling-backup/$(date +%Y%m%d)"
mkdir -p "$BACKUP_DIR"
# Backup PriorityClasses
kubectl get priorityclass -o yaml > "$BACKUP_DIR/priorityclasses.yaml"
# Backup PodDisruptionBudgets
kubectl get pdb -A -o yaml > "$BACKUP_DIR/pdbs.yaml"
# Backup node labels and taints
for node in $(kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); do
kubectl get node "$node" -o jsonpath='{.metadata.labels}' > "$BACKUP_DIR/${node}-labels.json"
kubectl get node "$node" -o jsonpath='{.spec.taints}' > "$BACKUP_DIR/${node}-taints.json"
done
# Optional: backup Kyverno policies if used
kubectl get clusterpolicy -o yaml > "$BACKUP_DIR/kyverno-policies.yaml" 2>/dev/null || true
echo "[$(date)] Scheduling config backup completed: $BACKUP_DIR"Restore procedure
Stop services to avoid scheduling conflicts.
Apply saved manifests:
kubectl apply -f ${BACKUP_DIR}/priorityclasses.yaml
kubectl apply -f ${BACKUP_DIR}/pdbs.yamlVerify PriorityClasses: kubectl get priorityclass.
Re‑apply node labels and taints from the JSON files.
Deploy a test Pod to confirm that scheduling policies behave as expected.
Conclusion
Key takeaways
Use nodeSelector for simple hard constraints, nodeAffinity for flexible soft/hard constraints, and combine with taints/tolerations for node‑level isolation.
Prefer topologySpreadConstraints over heavy podAffinity in large clusters; it provides deterministic, low‑overhead distribution.
PriorityClass preemption must be used cautiously; batch jobs should set preemptionPolicy: Never to avoid disrupting critical services.
Monitor scheduler latency, pending Pods, and node resource utilisation to detect bottlenecks early.
Further reading
Kubernetes Scheduler documentation – detailed description of scheduling mechanisms.
kube‑scheduler source code – for understanding algorithm implementation.
Descheduler project – tool for pod rebalancing.
Volcano project – batch scheduler with gang scheduling support.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
