Cloud Native 25 min read

How Kubernetes Schedules Pods: Deep Dive into Scheduling, QoS, and Resources

This article walks through the complete Kubernetes pod scheduling workflow, explains how resource requests, limits, and QoS classes influence placement, covers advanced features such as affinity, taints, resource quotas, and priority‑based preemption, and shows how to configure each mechanism for optimal cluster utilization.

Alibaba Cloud Native

Dec 23, 2019

How Kubernetes Schedules Pods: Deep Dive into Scheduling, QoS, and Resources

Kubernetes Scheduling Process

When a pod manifest (YAML) is submitted to the kube-apiserver, the request is first sent to any configured webhook controllers for validation. After passing validation, the API server creates a pod object with an empty nodeName and a Pending phase. Both the kube-scheduler and the node’s kubelet watch this event. The scheduler sees the empty nodeName, treats the pod as unscheduled, and runs a series of filtering and scoring algorithms to select the most suitable node. Once a node is chosen, the scheduler updates the pod’s spec.nodeName. The kubelet on that node then creates the required containers, storage, and network resources, finally setting the pod status to Running.

Pod Resource Requirements and QoS

Pod resources are defined in the spec.containers.resources section and consist of two maps: requests (minimum guaranteed resources) and limits (maximum allowed resources). Four basic resource types are supported: CPU, memory, ephemeral-storage, and extensible resources such as GPUs. CPU can be expressed as whole cores (e.g., 2) or millicores (e.g., 2000m); memory and storage use binary units (e.g., 1Gi).

Based on the relationship between requests and limits, Kubernetes assigns one of three QoS classes:

Guaranteed : request == limit for CPU and memory; the pod receives the highest scheduling priority and a fixed OOMScore of -998.

Burstable : request and limit differ; the pod gets a moderate OOMScore (2‑999) based on its memory request relative to node capacity.

BestEffort : no request or limit specified; the pod receives the lowest OOMScore of 1000 and is the first candidate for eviction.

When cpu-manager-policy=static is enabled and a pod’s CPU request is an integer, a Guaranteed pod can be bound to specific CPU cores (CPU pinning).

Resource Quota

To prevent a single namespace or user from exhausting cluster resources, Kubernetes provides ResourceQuota. A quota object defines hard limits for resources such as CPU, memory, pod count, and custom scopes (e.g., NotBestEffort, Terminating). When a pod creation exceeds the quota, the API server returns a 403 Forbidden error with an "exceeded quota" message.

Pod‑Pod Affinity and Anti‑Affinity

Kubernetes supports two affinity mechanisms:

PodAffinity : ensures that a pod is scheduled onto a node that already runs another pod matching specified label selectors. It can be requiredDuringSchedulingIgnoredDuringExecution (hard) or preferredDuringSchedulingIgnoredDuringExecution (soft with weight).

PodAntiAffinity : prevents a pod from being placed on a node that runs pods matching the selector, also supporting hard and soft variants.

Affinity rules use operators such as In, NotIn, Exists, and DoesNotExist to express complex placement policies.

Pod‑Node Affinity, NodeSelector, and Taints/Tolerations

NodeSelector is a simple key‑value match that forces a pod onto nodes with the specified label. It only supports hard requirements.

NodeAffinity extends this with both required and preferred rules and richer operators ( Gt, Lt) for numeric comparisons.

Taints are applied to nodes to repel pods; each taint has a key, value, and effect ( NoSchedule, PreferNoSchedule, NoExecute). Pods declare matching tolerations (with key, value, effect, and an operator of Exists or Equal) to be allowed on tainted nodes.

Priority Scheduling and Preemption

Kubernetes can assign a PriorityClass to pods. The scheduler first orders pods by priority; higher‑priority pods are considered before lower‑priority ones. If a high‑priority pod cannot be scheduled due to insufficient resources, the scheduler may preempt lower‑priority pods.

Preemption follows these steps:

Filter nodes that satisfy the pod’s constraints.

Simulate scheduling by temporarily removing lower‑priority pods.

Run ProcessPreemptionWithExtenders (optional custom logic).

Pick a node using a series of tie‑breakers: minimize PodDisruptionBudget violations, minimize total preempted priority, minimize number of preempted pods, and finally choose the node with the newest pod start time.

Delete the selected lower‑priority pods and schedule the high‑priority pod.

Built‑in priority classes include system-cluster-critical (value 20000000000) and system-node-critical. User‑defined classes can have values up to 10 000 000 000.

Key Takeaways

Define accurate requests and limits for CPU, memory, storage, and GPUs to obtain the desired QoS class.

Use ResourceQuota to enforce namespace‑level resource caps.

Apply pod‑pod affinity/anti‑affinity and pod‑node affinity or nodeSelector to control placement.

Leverage taints and tolerations to isolate special nodes.

Create custom PriorityClass objects and assign them via priorityClassName to enable priority‑based scheduling and preemption.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes Preemption QoS priority ResourceQuota Affinity

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.