Cloud Native 13 min read

Mastering Kubernetes Scheduling: From Basics to Performance Tuning

This article provides a comprehensive guide to Kubernetes Scheduler, covering its architecture, extension points, API, node affinity, taints and tolerations, scheduling bottlenecks, priority classes, and performance tuning techniques to optimize pod placement in large clusters.

MaGe Linux Operations

Aug 19, 2021

Mastering Kubernetes Scheduling: From Basics to Performance Tuning

Kubernetes Scheduler is a core component of the Kubernetes control plane that assigns Pods to nodes while balancing resource utilization across the cluster.

This article offers an in‑depth look at the Scheduler, starting with a general overview of scheduling, affinity, and taint‑based eviction, then discussing potential bottlenecks and production issues, and finally exploring how to fine‑tune Scheduler parameters for a given cluster.

Scheduling Overview

Scheduling is the process of assigning a Pod to a suitable node. The Scheduler watches newly created Pods and selects the best node based on Kubernetes scheduling policies and configuration options. The simplest option is to set nodeName directly in the PodSpec:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx
  nodeName: node-01

Using nodeName has many limitations (unknown node names in cloud, insufficient resources, network issues), so it is generally avoided outside of testing.

To run a Pod on a specific set of nodes, use nodeSelector with key‑value pairs:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx
  nodeSelector:
    disktype: ssd

This tells the Scheduler to place the Pod on a node labeled disktype=ssd.

Node Affinity

Node affinity defines constraints on which nodes a Pod can be scheduled onto, allowing hard (required) and soft (preferred) rules. Four affinity types exist:

requiredDuringSchedulingIgnoredDuringExecution

requiredDuringSchedulingRequiredDuringExecution

preferredDuringSchedulingIgnoredDuringExecution

preferredDuringSchedulingRequiredDuringExecution

Required rules must be satisfied; preferred rules are soft. The scheduling phase occurs when the Pod is first assigned, while the execution phase applies if node labels change after assignment.

Example of node affinity:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/region
            operator: In
            values:
            - us-east
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - us-east-1
            - us-east-2
  containers:
  - name: nginx
    image: nginx

Taint and Toleration

Nodes can be tainted to repel Pods unless the Pod has a matching toleration. Example taint command:

kubectl taint nodes node1 test-environment=true:NoSchedule

Corresponding toleration in a PodSpec:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx
  tolerations:
  - key: "test-environment"
    operator: "Exists"
    effect: "NoSchedule"

Scheduling Bottlenecks

Even after a Pod is placed, resource usage and node conditions can cause issues such as “Noisy Neighbor” problems, system‑process resource exhaustion, and the need for preemption or eviction.

Noisy Neighbor (Resource Requests and Limits)

Setting resource requests and limits prevents one container from monopolizing CPU or memory, ensuring fair sharing in multi‑tenant environments.

System Process Resource Exhaustion

Nodes run their own OS processes; using the --system-reserved flag in kubelet helps reserve resources for system daemons.

Preemption and Priority Classes

When the Scheduler cannot find a fit, it may preempt lower‑priority Pods. Define a PriorityClass to control this behavior:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority-nonpreempting
value: 100000
preemptionPolicy: Never
globalDefault: false
description: "This priority class will not preempt other pods."

Reference the class in a PodSpec:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx
  priorityClassName: high-priority-nonpreempting

Scheduler Framework

The Scheduler uses a pluggable framework with extension points such as QueueSort, PreFilter, Filter, PostFilter, PreScore, Score, NormalizeScore, Reserve, Permit, PreBind, Bind, and PostBind. Plugins implement the Plugin API and are registered by name.

// Plugin is the parent type for all scheduling framework plugins.

type Plugin interface {
    Name() string
}

// QueueSortPlugin is an interface that must be implemented by "QueueSort" plugins.
// These plugins are used to sort pods in the scheduling queue.

type QueueSortPlugin interface {
    Plugin
    Less(*QueuedPodInfo, *QueuedPodInfo) bool
}

Scheduler Performance Tuning

In large clusters, scoring every node can be expensive. The percentageOfNodesToScore setting limits how many nodes are evaluated. By default it scales linearly from 50 % for 100 nodes to 10 % for 5 000 nodes, with a minimum of 5 %.

Example configuration to set the percentage manually:

apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
algorithmSource:
  provider: DefaultProvider
percentageOfNodesToScore: 50

Summary

The article covers most aspects of Kubernetes scheduling, from Pod and node configuration (nodeSelector, affinity rules, taints, tolerations) to the Scheduler framework, extension points, API, resource‑related bottlenecks, and performance‑tuning settings. Understanding these details and configuring the Scheduler appropriately is essential for reliable production‑grade Kubernetes deployments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes scheduler Performance Tuning Node Affinity Taints

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.