Cloud Native 32 min read

Inside Kubernetes: How kube-scheduler Works and Its Source Code Explained

This article dissects the kube-scheduler component of Kubernetes v1.21, detailing its design, initialization, main scheduling loop, pre‑selection (Predicates) and prioritization (Priorities) algorithms, and key source‑code functions such as scheduler.New(), Run(), scheduleOne(), and the scheduling algorithm that binds Pods to Nodes.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Inside Kubernetes: How kube-scheduler Works and Its Source Code Explained

1. Design of kube-scheduler

Scheduler acts as the bridge between the controller manager and kubelet, receiving newly created Pods from the controller manager, selecting a suitable Node, and writing the binding information to etcd. The workflow includes creating a Pod via the apiserver REST API, storing it in etcd, the scheduler watching for new Pods, binding them to Nodes, and kubelet launching the containers.

The most critical components are the apiserver watch API and the scheduler's scheduling strategy.

In short, kube-scheduler selects the most appropriate Node for each unbound Pod, using two steps: pre‑selection (Predicates) and scoring (Priorities).

Pre‑selection (Predicates): filter out Nodes that do not satisfy the criteria, e.g., insufficient resources.

Priorities: rank the remaining Nodes based on scoring rules such as resource abundance and low load.

2. Source code analysis of kube-scheduler

Kubernetes version: v1.21

2.1 scheduler.New() – initializing the scheduler

The entry point is a runCommand that calls Setup, which creates a Config and a Scheduler via the New function.

Scheduler struct fields include SchedulerCache, NextPod, Error, StopEverything, SchedulingQueue, Profiles, etc.

The New method creates the default scheduling algorithm and the GenericScheduler, sets up the configuration file, initializes the default schedulerAlgorithmSource, loads default predicate and priority algorithms, and optionally overrides them with a policy config.

func New(client clientset.Interface, informerFactory informers.SharedInformerFactory, recorderFactory profile.RecorderFactory, stopCh <-chan struct{}, opts ...Option) (*Scheduler, error) {
    // ... core implementation ...
}

Two initialization paths are provided: source.Provider and source.Policy, both eventually creating a Scheduler via sched = sc.

2.2 Run() – starting the main logic

The component parses command‑line arguments, sets default values, then executes the Run method which starts event broadcasting, health checks, HTTP server, and all informers before calling sched.Run().

func Run(ctx context.Context, cc *schedulerserverconfig.CompletedConfig, sched *scheduler.Scheduler) error {
    // configure Configz, start event broadcaster, start http server, start informers,
    // wait for cache sync, leader election, then sched.Run(ctx)
}

The Run loop blocks on wait.UntilWithContext and repeatedly calls sched.scheduleOne.

2.3 sched.Run() – listening and scheduling

The Run method starts the SchedulingQueue, then calls wait.UntilWithContext(ctx, sched.scheduleOne, 0), and finally closes the queue.

func (sched *Scheduler) Run(ctx context.Context) {
    sched.SchedulingQueue.Run()
    wait.UntilWithContext(ctx, sched.scheduleOne, 0)
    sched.SchedulingQueue.Close()
}

The queue moves Pods from backoffQ and unschedulableQ to activeQ, handling retries with exponential back‑off.

2.4 scheduleOne() – pod assignment flow

scheduleOne extracts a Pod from the queue, checks its validity, obtains the appropriate scheduling profile, runs pre‑filter plugins, and invokes the scheduling algorithm to obtain a candidate Node.

func (sched *Scheduler) scheduleOne(ctx context.Context) {
    podInfo := sched.NextPod()
    if podInfo == nil || podInfo.Pod == nil {
        return
    }
    // ... framework selection, filtering, scheduling ...
}

It then assumes the Pod on the selected Node, runs reserve, permit, and prebind plugins, performs the bind operation, and finally runs postbind plugins.

2.5 sched.Algorithm.Schedule() – selecting a Node

The Schedule method snapshots the node cache, runs predicate filtering to obtain feasible Nodes, and if more than one remains, runs the priority phase to score them and pick the highest‑scoring Node.

func (g *genericScheduler) Schedule(ctx context.Context, fwk framework.Framework, state *framework.CycleState, pod *v1.Pod) (ScheduleResult, error) {
    // snapshot, filter, prioritize, select host
}

2.6 Pre‑selection algorithm

Pre‑selection uses findNodesThatFitPod which runs pre‑filter plugins, then filter plugins in parallel, and finally extender filters. The number of Nodes examined is limited by numFeasibleNodesToFind, which adapts based on cluster size and the percentageOfNodesToScore setting.

2.7 Prioritization algorithm

Prioritization runs pre‑score plugins, then score plugins, aggregates scores per Node, and optionally adds Extender scores. Scores are normalized to a 0‑100 range before being summed.

2.8 Selecting the host

selectHost

picks the Node with the highest total score; if multiple Nodes share the top score, one is chosen randomly.

func (g *genericScheduler) selectHost(nodeScoreList framework.NodeScoreList) (string, error) {
    // find max score, random tie‑break
}

3. Summary

The kube‑scheduler is a core component that bridges the control plane and worker nodes, performing a two‑stage process of filtering (Predicates) and scoring (Priorities) to bind Pods to Nodes. Understanding its initialization, main loop, and key functions such as New, Run, scheduleOne, and the scheduling algorithm provides insight into Kubernetes’ internal scheduling mechanics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cloud-nativeschedulerSource codekube-schedulerPredicatespriorities
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.