Cloud Native 36 min read

Stop Pods From “Running Wild”: A Practical Guide to Kubernetes Scheduling Strategies

This guide explains why default Kubernetes scheduling often falls short in production, introduces nodeSelector, nodeAffinity, podAffinity/anti‑affinity, taints/tolerations, topologySpreadConstraints and PriorityClass, and provides step‑by‑step configuration examples, real‑world use cases, best‑practice recommendations, troubleshooting tips, and monitoring alerts to ensure reliable pod placement.

Raymond Ops
Raymond Ops
Raymond Ops
Stop Pods From “Running Wild”: A Practical Guide to Kubernetes Scheduling Strategies

Overview

Pod scheduling is a core mechanism of Kubernetes that decides on which node a Pod will run. The default kube‑scheduler uses a two‑stage process of filtering (pre‑selection) and scoring (ranking). Production issues often include database Pods landing on nodes without SSDs, high‑load services sharing a node, and GPU Pods being scheduled on regular nodes, causing resource starvation.

These problems are solved by applying scheduling policies such as nodeSelector, nodeAffinity, podAffinity/podAntiAffinity, taints/tolerations, and topologySpreadConstraints. The following sections show how to configure each mechanism.

Technical characteristics

Multi‑layer scheduling control : from simple nodeSelector to custom schedulers, offering different granularity.

Soft and hard constraints combined : requiredDuringScheduling is hard, preferredDuringScheduling is soft.

Topology awareness : topologySpreadConstraints enables distribution across zones, nodes, racks, etc.

Preemption mechanism : PriorityClass allows high‑priority Pods to preempt lower‑priority ones.

Applicable scenarios

Schedule I/O‑intensive Pods (databases, caches) to SSD nodes and CPU‑intensive Pods to high‑CPU nodes.

Distribute multiple replicas of the same service across different nodes or availability zones.

Exclusive scheduling of special hardware (GPU, FPGA) to prevent ordinary Pods from occupying dedicated resources.

Isolate different teams in a multi‑tenant cluster by assigning each team its own node pool.

Environment requirements

Component          Version Requirement   Notes
Kubernetes         1.24+                topologySpreadConstraints GA since 1.19, PodSecurity GA since 1.25
kube‑scheduler     Must match cluster version   Custom scheduler needs separate deployment
Node labels        Planned in advance   Scheduling policies depend on consistent labeling
metrics‑server     >=0.6                Resource‑aware scheduling needs metrics data

Step‑by‑step guide

1. Prepare node labels

# View existing node labels
kubectl get nodes --show-labels

# Label nodes by hardware type
kubectl label node k8s-worker-01 disktype=ssd
kubectl label node k8s-worker-02 disktype=ssd
kubectl label node k8s-worker-03 disktype=hdd

# Label nodes by workload purpose
kubectl label node k8s-worker-01 workload-type=database
kubectl label node k8s-worker-02 workload-type=application
kubectl label node k8s-worker-03 workload-type=application

# Label nodes by availability zone (if multi‑datacenter)
kubectl label node k8s-worker-01 topology.kubernetes.io/zone=zone-a
kubectl label node k8s-worker-02 topology.kubernetes.io/zone=zone-b
kubectl label node k8s-worker-03 topology.kubernetes.io/zone=zone-c

# GPU node labels
kubectl label node k8s-gpu-01 accelerator=nvidia-tesla-v100
kubectl label node k8s-gpu-02 accelerator=nvidia-tesla-a100

# Verify labels
kubectl get nodes -L disktype,workload-type,topology.kubernetes.io/zone

Note Use a domain‑prefixed key for custom labels, e.g. company.com/team=backend. Built‑in labels use kubernetes.io or k8s.io prefixes.

2. Understand scheduler workflow

Filtering : discards nodes that lack required resources, mismatched nodeSelector, or unsatisfied taints.

Scoring : assigns a score to the remaining nodes and selects the highest‑scoring node.

# Enable verbose logging for the scheduler (add --v=4 to the manifest)
kubectl logs -n kube-system -l component=kube-scheduler --tail=50

3. Policy priority

nodeName

(highest – direct node assignment, bypasses scheduler)

taints/tolerations
nodeSelector
nodeAffinity
podAffinity/podAntiAffinity
topologySpreadConstraints

Resource requests

Core configurations

nodeSelector (simplest constraint)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-ssd
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx-ssd
  template:
    metadata:
      labels:
        app: nginx-ssd
    spec:
      nodeSelector:
        disktype: ssd
      containers:
      - name: nginx
        image: nginx:1.24
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 200m
            memory: 256Mi

Note nodeSelector is a hard constraint; if no node matches, the Pod stays Pending. Combine with nodeAffinity for soft constraints.

nodeAffinity (more flexible)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-with-affinity
spec:
  replicas: 6
  selector:
    matchLabels:
      app: app-affinity
  template:
    metadata:
      labels:
        app: app-affinity
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                - zone-a
                - zone-b
              - key: disktype
                operator: In
                values:
                - ssd
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 80
            preference:
              matchExpressions:
              - key: disktype
                operator: In
                values:
                - ssd
          - weight: 20
            preference:
              matchExpressions:
              - key: workload-type
                operator: In
                values:
                - application
      containers:
      - name: app
        image: nginx:1.24
        resources:
          requests:
            cpu: 200m
            memory: 256Mi

Operators supported: In, NotIn, Exists, DoesNotExist, Gt, Lt.

podAffinity / podAntiAffinity

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-frontend
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-frontend
  template:
    metadata:
      labels:
        app: web-frontend
    spec:
      affinity:
        podAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - redis-cache
              topologyKey: kubernetes.io/hostname
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - web-frontend
            topologyKey: kubernetes.io/hostname
      containers:
      - name: web
        image: nginx:1.24
        resources:
          requests:
            cpu: 200m
            memory: 256Mi

Warning The calculation complexity of podAffinity/podAntiAffinity is O(N²) where N is the number of Pods. In clusters with >5,000 Pods, heavy use can increase scheduling latency from milliseconds to seconds.

Taints and tolerations

# Add a taint to GPU nodes so only GPU‑tolerant Pods can run there
kubectl taint nodes k8s-gpu-01 gpu=true:NoSchedule
kubectl taint nodes k8s-gpu-02 gpu=true:NoSchedule

# Add a maintenance taint to a node (drains existing Pods)
kubectl taint nodes k8s-worker-03 maintenance=true:NoExecute

# Verify taints
kubectl describe node k8s-gpu-01 | grep -A 5 Taints

# Remove a taint
kubectl taint nodes k8s-worker-03 maintenance=true:NoExecute-

Pod‑side tolerations example:

apiVersion: batch/v1
kind: Job
metadata:
  name: gpu-training-job
spec:
  template:
    spec:
      tolerations:
      - key: "gpu"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
      nodeSelector:
        accelerator: nvidia-tesla-v100
      containers:
      - name: training
        image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            cpu: "4"
            memory: "16Gi"
      restartPolicy: Never

Taint effects: NoSchedule: new Pods are blocked; existing Pods are unaffected. PreferNoSchedule: scheduler tries to avoid the node but may place Pods if needed. NoExecute: new Pods are blocked and existing Pods that do not tolerate the taint are evicted after tolerationSeconds (if set).

topologySpreadConstraints (fine‑grained topology distribution)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-spread
spec:
  replicas: 9
  selector:
    matchLabels:
      app: app-spread
  template:
    metadata:
      labels:
        app: app-spread
    spec:
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: app-spread
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: app-spread
      containers:
      - name: app
        image: nginx:1.24
        resources:
          requests:
            cpu: 100m
            memory: 128Mi

Parameter notes maxSkew: maximum allowed difference in Pod count between any two topology domains; 1 enforces near‑perfect balance. whenUnsatisfiable: DoNotSchedule (hard) or ScheduleAnyway (soft). labelSelector must match the Pod’s own labels; otherwise the constraint is ignored.

Best practices and caveats

Performance optimisation

Reduce podAffinity usage: in a 500‑node cluster a pod with podAffinity can increase scheduling time from 5 ms to 200 ms. Prefer topologySpreadConstraints where possible.

Set realistic requests based on the 95th‑percentile of actual usage; the scheduler bases decisions on requests, not limits.

Use the Descheduler to evict pods that violate current policies after node scaling or migrations.

Security hardening

Restrict direct nodeName usage via RBAC to prevent bypassing scheduler checks.

Control creation of high‑priority PriorityClass objects; only privileged roles may create or use them.

Protect taint modifications with admission webhooks to avoid accidental removal of critical taints.

High‑availability patterns

Deploy at least three replicas of core services, enforce hard podAntiAffinity across nodes, and add topologySpreadConstraints across zones.

Use a PodDisruptionBudget (e.g., minAvailable: 2) to guarantee service continuity during node maintenance.

Store scheduling manifests (PriorityClass, PDB, node labels/taints) in a GitOps repository and apply with ArgoCD or FluxCD.

Common pitfalls

Multiple nodeSelectorTerms are OR‑ed; multiple matchExpressions inside a term are AND‑ed.

Hard podAntiAffinity with topologyKey=kubernetes.io/hostname limits replica count to the number of nodes.

Incorrect labelSelector in topologySpreadConstraints leads to uneven distribution.

Troubleshooting and monitoring

Debugging steps

Check scheduler logs:

kubectl logs -n kube-system -l component=kube-scheduler --tail=100

Inspect Pod events: kubectl describe pod <pod> | grep -A 10 Events Verify node labels: kubectl get nodes --show-labels Validate manifests with dry‑run:

kubectl apply --dry-run=server -f <file>.yaml

Key metrics

Scheduler latency (P99): scheduler_scheduling_algorithm_duration_seconds Pending Pods count: scheduler_pending_pods Preemption events: scheduler_preemption_victims Node CPU/Memory utilisation:

kubectl top nodes

Prometheus alerts (example)

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: scheduler-alerts
spec:
  groups:
  - name: kube-scheduler
    rules:
    - alert: SchedulerHighLatency
      expr: histogram_quantile(0.99, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket[5m])) by (le)) > 0.5
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Scheduler P99 latency exceeds 500ms"
    - alert: PodsPendingTooLong
      expr: sum(scheduler_pending_pods{queue=\"active\"}) > 10
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "More than 10 pods pending for over 5 minutes"
    - alert: SchedulerUnhealthy
      expr: absent(up{job=\"kube-scheduler\"} == 1)
      for: 3m
      labels:
        severity: critical
      annotations:
        summary: "kube-scheduler is not running"
    - alert: NodeHighAllocation
      expr: (1 - sum(kube_node_status_allocatable{resource=\"cpu\"} - kube_pod_container_resource_requests{resource=\"cpu\"}) by (node) / sum(kube_node_status_allocatable{resource=\"cpu\"}) by (node) > 0.85
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Node {{ $labels.node }} CPU allocation exceeds 85%"
    - alert: FrequentPreemption
      expr: increase(scheduler_preemption_victims[1h]) > 5
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "More than 5 preemption events in the last hour"

Backup and restore

Backup script (example)

#!/bin/bash
set -euo pipefail
BACKUP_DIR="/data/scheduling-backup/$(date +%Y%m%d)"
mkdir -p "$BACKUP_DIR"

# Backup PriorityClasses
kubectl get priorityclass -o yaml > "$BACKUP_DIR/priorityclasses.yaml"

# Backup PodDisruptionBudgets
kubectl get pdb -A -o yaml > "$BACKUP_DIR/pdbs.yaml"

# Backup node labels and taints
for node in $(kubectl get nodes -o jsonpath='{.items[*].metadata.name}'); do
  kubectl get node "$node" -o jsonpath='{.metadata.labels}' > "$BACKUP_DIR/${node}-labels.json"
  kubectl get node "$node" -o jsonpath='{.spec.taints}' > "$BACKUP_DIR/${node}-taints.json"
done

# Optional: backup Kyverno policies if used
kubectl get clusterpolicy -o yaml > "$BACKUP_DIR/kyverno-policies.yaml" 2>/dev/null || true

echo "[$(date)] Scheduling config backup completed: $BACKUP_DIR"

Restore procedure

Stop services to avoid scheduling conflicts.

Apply saved manifests:

kubectl apply -f ${BACKUP_DIR}/priorityclasses.yaml
kubectl apply -f ${BACKUP_DIR}/pdbs.yaml

Verify PriorityClasses: kubectl get priorityclass.

Re‑apply node labels and taints from the JSON files.

Deploy a test Pod to confirm that scheduling policies behave as expected.

Conclusion

Key takeaways

Use nodeSelector for simple hard constraints, nodeAffinity for flexible soft/hard constraints, and combine with taints/tolerations for node‑level isolation.

Prefer topologySpreadConstraints over heavy podAffinity in large clusters; it provides deterministic, low‑overhead distribution.

PriorityClass preemption must be used cautiously; batch jobs should set preemptionPolicy: Never to avoid disrupting critical services.

Monitor scheduler latency, pending Pods, and node resource utilisation to detect bottlenecks early.

Further reading

Kubernetes Scheduler documentation – detailed description of scheduling mechanisms.

kube‑scheduler source code – for understanding algorithm implementation.

Descheduler project – tool for pod rebalancing.

Volcano project – batch scheduler with gang scheduling support.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Kubernetesbest practicesPod SchedulingtopologySpreadConstraintsPriorityClassnodeAffinityTaints and Tolerations
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.