How Kubernetes Schedules Pods: Deep Dive into Scheduling, QoS, and Resources
This article walks through the complete Kubernetes pod scheduling workflow, explains how resource requests, limits, and QoS classes influence placement, covers advanced features such as affinity, taints, resource quotas, and priority‑based preemption, and shows how to configure each mechanism for optimal cluster utilization.
Kubernetes Scheduling Process
When a pod manifest (YAML) is submitted to the kube-apiserver, the request is first sent to any configured webhook controllers for validation. After passing validation, the API server creates a pod object with an empty nodeName and a Pending phase. Both the kube-scheduler and the node’s kubelet watch this event. The scheduler sees the empty nodeName, treats the pod as unscheduled, and runs a series of filtering and scoring algorithms to select the most suitable node. Once a node is chosen, the scheduler updates the pod’s spec.nodeName. The kubelet on that node then creates the required containers, storage, and network resources, finally setting the pod status to Running.
Pod Resource Requirements and QoS
Pod resources are defined in the spec.containers.resources section and consist of two maps: requests (minimum guaranteed resources) and limits (maximum allowed resources). Four basic resource types are supported: CPU, memory, ephemeral-storage, and extensible resources such as GPUs. CPU can be expressed as whole cores (e.g., 2) or millicores (e.g., 2000m); memory and storage use binary units (e.g., 1Gi).
Based on the relationship between requests and limits, Kubernetes assigns one of three QoS classes:
Guaranteed : request == limit for CPU and memory; the pod receives the highest scheduling priority and a fixed OOMScore of -998.
Burstable : request and limit differ; the pod gets a moderate OOMScore (2‑999) based on its memory request relative to node capacity.
BestEffort : no request or limit specified; the pod receives the lowest OOMScore of 1000 and is the first candidate for eviction.
When cpu-manager-policy=static is enabled and a pod’s CPU request is an integer, a Guaranteed pod can be bound to specific CPU cores (CPU pinning).
Resource Quota
To prevent a single namespace or user from exhausting cluster resources, Kubernetes provides ResourceQuota. A quota object defines hard limits for resources such as CPU, memory, pod count, and custom scopes (e.g., NotBestEffort, Terminating). When a pod creation exceeds the quota, the API server returns a 403 Forbidden error with an "exceeded quota" message.
Pod‑Pod Affinity and Anti‑Affinity
Kubernetes supports two affinity mechanisms:
PodAffinity : ensures that a pod is scheduled onto a node that already runs another pod matching specified label selectors. It can be requiredDuringSchedulingIgnoredDuringExecution (hard) or preferredDuringSchedulingIgnoredDuringExecution (soft with weight).
PodAntiAffinity : prevents a pod from being placed on a node that runs pods matching the selector, also supporting hard and soft variants.
Affinity rules use operators such as In, NotIn, Exists, and DoesNotExist to express complex placement policies.
Pod‑Node Affinity, NodeSelector, and Taints/Tolerations
NodeSelector is a simple key‑value match that forces a pod onto nodes with the specified label. It only supports hard requirements.
NodeAffinity extends this with both required and preferred rules and richer operators ( Gt, Lt) for numeric comparisons.
Taints are applied to nodes to repel pods; each taint has a key, value, and effect ( NoSchedule, PreferNoSchedule, NoExecute). Pods declare matching tolerations (with key, value, effect, and an operator of Exists or Equal) to be allowed on tainted nodes.
Priority Scheduling and Preemption
Kubernetes can assign a PriorityClass to pods. The scheduler first orders pods by priority; higher‑priority pods are considered before lower‑priority ones. If a high‑priority pod cannot be scheduled due to insufficient resources, the scheduler may preempt lower‑priority pods.
Preemption follows these steps:
Filter nodes that satisfy the pod’s constraints.
Simulate scheduling by temporarily removing lower‑priority pods.
Run ProcessPreemptionWithExtenders (optional custom logic).
Pick a node using a series of tie‑breakers: minimize PodDisruptionBudget violations, minimize total preempted priority, minimize number of preempted pods, and finally choose the node with the newest pod start time.
Delete the selected lower‑priority pods and schedule the high‑priority pod.
Built‑in priority classes include system-cluster-critical (value 20000000000) and system-node-critical. User‑defined classes can have values up to 10 000 000 000.
Key Takeaways
Define accurate requests and limits for CPU, memory, storage, and GPUs to obtain the desired QoS class.
Use ResourceQuota to enforce namespace‑level resource caps.
Apply pod‑pod affinity/anti‑affinity and pod‑node affinity or nodeSelector to control placement.
Leverage taints and tolerations to isolate special nodes.
Create custom PriorityClass objects and assign them via priorityClassName to enable priority‑based scheduling and preemption.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
