Cloud Native 19 min read

Mastering kube-scheduler: How Kubernetes Schedules Pods Efficiently

This article explains how kube-scheduler in Kubernetes orchestrates pod placement by applying pre‑selection (predicates) and scoring (priorities) strategies, discusses fairness, resource efficiency, speed, flexibility, details common predicate and priority algorithms, and demonstrates practical scenarios with YAML and command‑line examples.

StarRing Big Data Open Lab

Jan 25, 2019

Mastering kube-scheduler: How Kubernetes Schedules Pods Efficiently

Introduction

Scheduling is a critical step in container orchestration. The kube-scheduler component of Kubernetes ensures that Pods are placed on suitable nodes while meeting various production constraints such as dedicated machines for certain services or disaster‑recovery distribution across nodes.

kube-scheduler acts as a caretaker, providing scheduling services for Pods based on mechanisms like resource‑fair scheduling, binding Pods to specific nodes, or co‑locating frequently communicating Pods.

The scheduler must achieve several goals:

Fairness – each node should have a chance to receive resources.

Efficient resource utilization – maximize CPU, memory, etc., across the cluster.

Performance – quickly schedule large numbers of Pods even as the cluster scales.

Flexibility – allow users to control scheduling policies, support multiple schedulers, and enable custom schedulers.

To meet these goals, kube-scheduler evaluates node resources, load, data locality, and other factors, influencing the overall availability and performance of a Kubernetes cluster, especially when thousands of nodes are involved.

Scheduling Process

The core task of kube-scheduler is to bind a Pod to the most appropriate node. The process consists of two stages: Predicates (pre‑selection) and Priorities (scoring) .

Predicates (Pre‑selection)

Input: all nodes. Output: nodes that satisfy pre‑selection conditions. Nodes that fail conditions such as insufficient resources or mismatched labels are filtered out.

Priorities (Scoring)

Input: nodes that passed the predicate stage. Each node receives a score based on priority functions; the node with the highest total score is selected.

In simple terms, scheduling answers two questions: 1) Which nodes are candidates? 2) Which candidate is the best?

If no node satisfies the predicates, the Pod remains in Pending state and the scheduler keeps retrying.

Predicate Strategies

kube-scheduler supports many predicate algorithms. Common ones include:

Volume count limits : MaxEBSVolumeCount, MaxGCEPDVolumeCount, MaxAzureDiskVolumeCount – ensure the number of attached volumes does not exceed a configured maximum.

Resource pressure checks : CheckNodeMemoryPressure, CheckNodeDiskPressure – prevent scheduling to nodes under memory or disk pressure.

Volume conflict checks : NoDiskConflict, NoVolumeZoneConflict, NoVolumeNodeConflict – avoid placing Pods that use the same volume on the same node.

Constraint checks : MatchNodeSelector, MatchInterPodAffinity, PodToleratesNodeTaints – verify node labels, pod affinity, and taint‑toleration relationships.

Fit checks : PodFitsResources, PodFitsHostPorts, PodFitsHost – ensure sufficient CPU/memory, free host ports, and matching node name.

Priority Strategies

During the priority phase, each node receives a score from 0‑10 for each priority function; the final score is the weighted sum of all functions.

LeastRequestedPriority (default weight 1): prefers nodes with the smallest amount of requested CPU and memory.

BalancedResourceAllocation (default weight 1): gives higher weight to nodes where CPU and memory usage are balanced.

SelectorSpreadPriority (default weight 1): spreads Pods of the same Service or ReplicationController across different nodes or zones.

NodeAffinityPriority (default weight 1): prefers nodes whose labels match the Pod's affinity requirements, similar to the MatchNodeSelector predicate.

InterPodAffinityPriority (default weight 1): adds weight based on existing Pods' affinity on a node; the node with the highest total weight wins.

NodePreferAvoidPodsPriority (default weight 10000): heavily penalizes nodes that should avoid certain Pods, effectively overriding other scores.

TaintTolerationPriority (default weight 1): scores nodes based on how many taints a Pod can tolerate.

ImageLocalityPriority (default weight 1): prefers nodes that already have the required container images cached.

EqualPriority (default weight 1): gives all nodes equal weight, mainly for testing.

MostRequestedPriority (default weight 1): prefers nodes with the highest resource utilization, useful for scaling scenarios.

Case Studies

Scenario 1 – Schedule to SSD‑backed Nodes

Use the MatchNodeSelector predicate to label an SSD node and add the same label to the Redis Deployment.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: redis-master
  labels:
    name: redis
  namespace: default
spec:
  replicas: 4
  template:
    metadata:
      labels:
        name: redis
    spec:
      containers:
      - name: master
        image: 172.16.1.41:5000/redis:3.0.5
        resources:
          requests:
            cpu: 100m
            memory: 100Mi

Label the node: kubectl label node transwarp disk=ssd Patch the Deployment to add the node selector:

kubectl patch deploy redis-master -p '{"spec":{"template":{"spec":{"nodeSelector":{"disk":"ssd"}}}}}'

Scenario 2 – Restrict to CentOS Nodes

Apply NodeAffinityPriority with a required selector for operation system values centos 7.2 or centos 7.3.

requiredDuringSchedulingIgnoredDuringExecution:
  nodeSelectorTerms:
  - matchExpressions:
    - key: operation system
      operator: In
      values:
      - centos 7.2
      - centos 7.3

Optionally add a preferred rule to favor centos 7.2 nodes.

preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
  preference:
    matchExpressions:
    - key: another-node-label-key-system
      operator: In
      values:
      - another-node-label-value-centos7.2

Scenario 3 – Co‑locate Pods in the Same Zone

Use InterPodAffinityPriority so that an API service and its authentication service run in the same availability zone.

apiVersion: v1
kind: Pod
metadata:
  name: pod-flag
  labels:
    security: "cloud"
spec:
  containers:
  - name: nginx
    image: nginx

Authentication service YAML adds required pod affinity and a topology key:

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-affinity
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: security
            operator: In
            values:
            - cloud
        topologyKey: failure-domain.beta.kubernetes.io/zone

Scenario 4 – Isolate GPU‑Intensive Workloads

Mark GPU nodes with a taint and add a matching toleration to Pods that require GPU resources.

kubectl taint nodes node1 gpu=true:NoSchedule

apiVersion: v1
kind: Pod
metadata:
  generateName: redis-
  labels:
    app: redis
  namespace: default
spec:
  containers:
  - image: 172.16.1.41:5000/redis
    imagePullPolicy: Always
    name: redis
  schedulerName: default-scheduler
  tolerations:
  - effect: NoSchedule
    key: gpu
    operator: Equal
    value: true

Conclusion and Outlook

kube-scheduler already offers a rich set of scheduling strategies that satisfy most common needs. Its plugin architecture enables users to customize or extend the scheduler for special resources such as local volumes or GPUs. Future improvements include caching to reduce repeated calculations in the predicate and priority phases, scheduler‑extender extensions for more nuanced resource handling, and integration with custom‑metrics APIs for real‑time scheduling decisions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud native Pod Scheduling kube-scheduler Predicates priorities

Written by

StarRing Big Data Open Lab

Focused on big data technology research, exploring the Big Data era | [email protected]

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.