Mastering Kubernetes Scheduling: From Basics to Performance Tuning
This article provides a comprehensive guide to Kubernetes Scheduler, covering its architecture, extension points, API, node affinity, taints and tolerations, scheduling bottlenecks, priority classes, and performance tuning techniques to optimize pod placement in large clusters.
Kubernetes Scheduler is a core component of the Kubernetes control plane that assigns Pods to nodes while balancing resource utilization across the cluster.
This article offers an in‑depth look at the Scheduler, starting with a general overview of scheduling, affinity, and taint‑based eviction, then discussing potential bottlenecks and production issues, and finally exploring how to fine‑tune Scheduler parameters for a given cluster.
Scheduling Overview
Scheduling is the process of assigning a Pod to a suitable node. The Scheduler watches newly created Pods and selects the best node based on Kubernetes scheduling policies and configuration options. The simplest option is to set nodeName directly in the PodSpec:
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
nodeName: node-01Using nodeName has many limitations (unknown node names in cloud, insufficient resources, network issues), so it is generally avoided outside of testing.
To run a Pod on a specific set of nodes, use nodeSelector with key‑value pairs:
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
nodeSelector:
disktype: ssdThis tells the Scheduler to place the Pod on a node labeled disktype=ssd.
Node Affinity
Node affinity defines constraints on which nodes a Pod can be scheduled onto, allowing hard (required) and soft (preferred) rules. Four affinity types exist:
requiredDuringSchedulingIgnoredDuringExecution
requiredDuringSchedulingRequiredDuringExecution
preferredDuringSchedulingIgnoredDuringExecution
preferredDuringSchedulingRequiredDuringExecution
Required rules must be satisfied; preferred rules are soft. The scheduling phase occurs when the Pod is first assigned, while the execution phase applies if node labels change after assignment.
Example of node affinity:
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/region
operator: In
values:
- us-east
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1
- us-east-2
containers:
- name: nginx
image: nginxTaint and Toleration
Nodes can be tainted to repel Pods unless the Pod has a matching toleration. Example taint command:
kubectl taint nodes node1 test-environment=true:NoScheduleCorresponding toleration in a PodSpec:
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
tolerations:
- key: "test-environment"
operator: "Exists"
effect: "NoSchedule"Scheduling Bottlenecks
Even after a Pod is placed, resource usage and node conditions can cause issues such as “Noisy Neighbor” problems, system‑process resource exhaustion, and the need for preemption or eviction.
Noisy Neighbor (Resource Requests and Limits)
Setting resource requests and limits prevents one container from monopolizing CPU or memory, ensuring fair sharing in multi‑tenant environments.
System Process Resource Exhaustion
Nodes run their own OS processes; using the --system-reserved flag in kubelet helps reserve resources for system daemons.
Preemption and Priority Classes
When the Scheduler cannot find a fit, it may preempt lower‑priority Pods. Define a PriorityClass to control this behavior:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority-nonpreempting
value: 100000
preemptionPolicy: Never
globalDefault: false
description: "This priority class will not preempt other pods."Reference the class in a PodSpec:
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
priorityClassName: high-priority-nonpreemptingScheduler Framework
The Scheduler uses a pluggable framework with extension points such as QueueSort, PreFilter, Filter, PostFilter, PreScore, Score, NormalizeScore, Reserve, Permit, PreBind, Bind, and PostBind. Plugins implement the Plugin API and are registered by name.
// Plugin is the parent type for all scheduling framework plugins.
type Plugin interface {
Name() string
} // QueueSortPlugin is an interface that must be implemented by "QueueSort" plugins.
// These plugins are used to sort pods in the scheduling queue.
type QueueSortPlugin interface {
Plugin
Less(*QueuedPodInfo, *QueuedPodInfo) bool
}Scheduler Performance Tuning
In large clusters, scoring every node can be expensive. The percentageOfNodesToScore setting limits how many nodes are evaluated. By default it scales linearly from 50 % for 100 nodes to 10 % for 5 000 nodes, with a minimum of 5 %.
Example configuration to set the percentage manually:
apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
algorithmSource:
provider: DefaultProvider
percentageOfNodesToScore: 50Summary
The article covers most aspects of Kubernetes scheduling, from Pod and node configuration (nodeSelector, affinity rules, taints, tolerations) to the Scheduler framework, extension points, API, resource‑related bottlenecks, and performance‑tuning settings. Understanding these details and configuring the Scheduler appropriately is essential for reliable production‑grade Kubernetes deployments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
