Cloud Native 12 min read

How Kubernetes Scheduler Works: Inside the Core Scheduling Engine

This article explains the inner workings of the Kubernetes scheduler, covering its architecture, pod queue handling, filtering, prioritization, binding, preemption, and code-level details, while also discussing current limitations and future enhancements such as the V2 framework and gang scheduling extensions.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
How Kubernetes Scheduler Works: Inside the Core Scheduling Engine

Overview

Kubernetes runs the scheduler as an independent component on the control‑plane nodes. Scheduler instances elect a leader via Raft; the leader executes the main scheduling loop while the others remain idle.

Scheduler Workflow

The scheduler maintains an internal podQueue and watches the API server for new Pod objects.

When a pod is created, its metadata is stored in etcd through the API server.

An informer adds the new pod to podQueue.

The main loop repeatedly dequeues a pod and runs the scheduling phase.

Scheduling consists of two steps:

Filter : run predicate functions to discard nodes that do not satisfy required conditions.

Prioritize : score the remaining nodes based on criteria such as resource usage, affinity, and custom weights.

If a node is selected, the scheduler calls the API server’s binding endpoint, setting pod.Spec.NodeName to the chosen node.

The kubelet on the target node watches the API server; when it sees the bound pod, it launches the containers.

If scheduling fails and preemption is enabled, lower‑priority pods may be evicted; otherwise the pod is re‑queued.

Implementation Details

The open‑source component kube-scheduler implements the scheduler. Its entry point is the Run() method, which starts a goroutine that repeatedly invokes scheduleOne():

func (sched *Scheduler) Run() {
    if !sched.config.WaitForCacheSync() {
        return
    }
    go wait.Until(sched.scheduleOne, 0, sched.config.StopEverything)
}
scheduleOne()

performs the following actions:

func (sched *Scheduler) scheduleOne() {
    pod := sched.config.NextPod()
    // pre‑check logic omitted
    scheduleResult, err := sched.schedule(pod)
    if err != nil {
        if fitError, ok := err.(*core.FitError); ok {
            if !util.PodPriorityEnabled() || sched.config.DisablePreemption {
                // log and continue
            } else {
                sched.preempt(pod, fitError)
            }
        }
    }
    // bind volumes, then bind pod to node
    err = sched.bind(pod, &v1.Binding{...})
}

The core scheduling logic resides in genericScheduler.Schedule():

func (g *genericScheduler) Schedule(pod *v1.Pod, nodeLister algorithm.NodeLister) (ScheduleResult, error) {
    nodes, err := nodeLister.List()
    filteredNodes, failedPredicateMap, err := g.findNodesThatFit(pod, nodes)
    if err != nil { return result, err }
    priorityList, err := PrioritizeNodes(pod, g.cachedNodeInfoMap, metaPrioritiesInterface, g.prioritizers, filteredNodes, g.extenders)
    if err != nil { return result, err }
    host, err := g.selectHost(priorityList)
    return ScheduleResult{SuggestedHost: host, EvaluatedNodes: len(filteredNodes)+len(failedPredicateMap), FeasibleNodes: len(filteredNodes)}, err
}

The scheduling process is divided into three phases:

Filters : concurrently run a set of predicate functions on each node to discard unsuitable nodes.

Prioritize : run scoring functions (Map phase) on the filtered nodes, then combine scores (Reduce phase) using configured weights.

SelectHost : pick the highest‑scoring node(s) and apply a round‑robin selection to choose the final target.

Since Kubernetes 1.13 the bad-percentage-of-nodes-to-score flag limits the number of nodes examined during filtering, improving performance on large clusters at the cost of potentially sub‑optimal placement.

Current Limitations

Pod‑by‑Pod scheduling scales poorly in clusters with thousands of nodes because each pod is evaluated against every node.

The default scheduler is tuned for online services and does not natively support gang scheduling required by batch‑oriented workloads such as distributed machine‑learning jobs.

Extensibility is limited; custom scheduling logic often requires hard‑coded changes in the core flow (e.g., volume binding), making it difficult to implement new algorithms without modifying the scheduler source.

Future Directions

The Scheduler V2 framework introduces a plug‑in architecture that improves extensibility and prepares the ground for native gang‑scheduling support.

External projects provide reference implementations:

Kube‑batch – a gang‑scheduling implementation (https://github.com/kubernetes-sigs/kube-batch)

Poseidon – integration of the Firmament graph‑based scheduler (https://github.com/kubernetes-sigs/poseidon)

References

https://medium.com/jorgeacetozi/kubernetes-master-components-etcd-api-server-controller-manager-and-scheduler-3a0179fc8186

https://jvns.ca/blog/2017/07/27/how-does-the-kubernetes-scheduler-work/

Kubernetes scheduler diagram
Kubernetes scheduler diagram
Scheduler workflow
Scheduler workflow
Future architecture
Future architecture
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsKubernetesGoScheduler
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.