Cloud Native 17 min read

Understanding the Kubernetes Scheduler: Architecture, Scoring Phases, and Custom Plugin Development

This article explains the inner workings of the Kubernetes scheduler, details its pre‑score, score, and normalize‑score phases, discusses zone handling and resource fragmentation, and demonstrates how to extend the scheduler using the Scheduling Framework with a custom Go plugin.

政采云技术
政采云技术
政采云技术
Understanding the Kubernetes Scheduler: Architecture, Scoring Phases, and Custom Plugin Development

The Kubernetes scheduler acts as the brain that decides where each pod should run, simply filling pod.spec.nodeName with a node name but using complex algorithms to choose the most suitable node based on the workload scenario.

It operates through two main control loops: the first reads unscheduled pods from etcd and enqueues them; the second dequeues a pod, applies Predicates to filter nodes, then runs Priorities to score the remaining nodes from 0 to 100, finally updating the pod's nodeName .

To keep replicas of the same controller spread across nodes, the scheduler scores nodes in three stages. In the PreScore stage it gathers controller selector information and stores it in CycleState for later use.

func (pl *SelectorSpread) PreScore(ctx context.Context, cycleState *framework.CycleState, pod *v1.Pod, nodes []*v1.Node) *framework.Status {
    if skipSelectorSpread(pod) {
        return nil
    }
    var selector labels.Selector
    selector = helper.DefaultSelector(
        pod,
        pl.services,
        pl.replicationControllers,
        pl.replicaSets,
        pl.statefulSets,
    )
    state := &preScoreState{selector: selector}
    cycleState.Write(preScoreStateKey, state)
    return nil
}

During the Score stage, the plugin counts matching pods on each node; each matching pod adds one point, and the total becomes the node's raw score.

func (pl *SelectorSpread) Score(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
    if skipSelectorSpread(pod) {
        return 0, nil
    }
    c, err := state.Read(preScoreStateKey)
    if err != nil {
        return 0, framework.NewStatus(framework.Error, fmt.Sprintf("Error reading %q from cycleState: %v", preScoreStateKey, err))
    }
    s, ok := c.(*preScoreState)
    if !ok {
        return 0, framework.NewStatus(framework.Error, fmt.Sprintf("%+v convert to preScoreState error", c))
    }
    nodeInfo, err := pl.sharedLister.NodeInfos().Get(nodeName)
    if err != nil {
        return 0, framework.NewStatus(framework.Error, fmt.Sprintf("getting node %q from Snapshot: %v", nodeName, err))
    }
    count := countMatchingPods(pod.Namespace, s.selector, nodeInfo)
    return int64(count), nil
}

The NormalizeScore stage adjusts these raw scores to a final 0‑100 range, also taking zone information into account to further spread pods across failure domains.

func (pl *SelectorSpread) NormalizeScore(ctx context.Context, state *framework.CycleState, pod *v1.Pod, scores framework.NodeScoreList) *framework.Status {
    if skipSelectorSpread(pod) {
        return nil
    }
    // aggregate scores per zone
    countsByZone := make(map[string]int64, 10)
    var maxCountByZone int64
    var maxCountByNodeName int64
    for i := range scores {
        if scores[i].Score > maxCountByNodeName {
            maxCountByNodeName = scores[i].Score
        }
        nodeInfo, err := pl.sharedLister.NodeInfos().Get(scores[i].Name)
        if err != nil {
            return framework.NewStatus(framework.Error, fmt.Sprintf("getting node %q from Snapshot: %v", scores[i].Name, err))
        }
        zoneID := utilnode.GetZoneKey(nodeInfo.Node())
        if zoneID == "" {
            continue
        }
        countsByZone[zoneID] += scores[i].Score
    }
    for _, v := range countsByZone {
        if v > maxCountByZone {
            maxCountByZone = v
        }
    }
    haveZones := len(countsByZone) != 0
    maxNodeScoreF := float64(framework.MaxNodeScore)
    for i := range scores {
        fScore := maxNodeScoreF
        if maxCountByNodeName > 0 {
            fScore = maxNodeScoreF * (float64(maxCountByNodeName-scores[i].Score) / float64(maxCountByNodeName))
        }
        if haveZones {
            nodeInfo, err := pl.sharedLister.NodeInfos().Get(scores[i].Name)
            if err != nil {
                return framework.NewStatus(framework.Error, fmt.Sprintf("getting node %q from Snapshot: %v", scores[i].Name, err))
            }
            zoneID := utilnode.GetZoneKey(nodeInfo.Node())
            if zoneID != "" {
                zoneScore := maxNodeScoreF
                if maxCountByZone > 0 {
                    zoneScore = maxNodeScoreF * (float64(maxCountByZone-countsByZone[zoneID]) / float64(maxCountByZone))
                }
                fScore = fScore*(1.0-zoneWeighting) + zoneWeighting*zoneScore
            }
        }
        scores[i].Score = int64(fScore)
    }
    return nil
}

A zone is defined by node labels such as failure-domain.beta.kubernetes.io/zone and failure-domain.beta.kubernetes.io/region ; the function GetZoneKey builds a composite key from these labels.

func GetZoneKey(node *v1.Node) string {
    labels := node.Labels
    if labels == nil {
        return ""
    }
    zone, ok := labels[v1.LabelZoneFailureDomain]
    if !ok {
        zone, _ = labels[v1.LabelZoneFailureDomainStable]
    }
    region, ok := labels[v1.LabelZoneRegion]
    if !ok {
        region, _ = labels[v1.LabelZoneRegionStable]
    }
    if region == "" && zone == "" {
        return ""
    }
    return region + ":\x00:" + zone
}

Resource fragmentation occurs when nodes have leftover resources that cannot satisfy larger pod requests; the article suggests a strategy of filling one node completely before scheduling onto another to mitigate this issue.

To extend the scheduler, the modern approach is to use the Scheduling Framework , which provides twelve extension points such as QueueSort, Pre‑filter, Filter, PreScore, Score, NormalizeScore, Reserve, Permit, Pre‑bind, Bind, and Unreserve.

The article walks through a minimal demo plugin written in Go for kube‑scheduler 1.19, showing the implementation of the Name , Score , and ScoreExtensions methods, registration of the plugin in the scheduler command, and a sample configuration file that enables the plugin in the score extension point.

package pkg

import (
    v1 "k8s.io/api/core/v1"
    "k8s.io/apimachinery/pkg/runtime"
    "context"
    framework "k8s.io/kubernetes/pkg/scheduler/framework/v1alpha1"
)

type demoPlugin struct {}

func NewDemoPlugin(_ runtime.Unknown, _ framework.FrameworkHandle) (framework.Plugin, error) {
    return &demoPlugin{}, nil
}

func (d *demoPlugin) Name() string { return "demo-plugin" }

func (d *demoPlugin) Score(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
    return 100, nil
}

func (d *demoPlugin) ScoreExtensions() framework.ScoreExtensions { return nil }
package main

import (
    "git.cai-inc.com/devops/zcy-scheduler/pkg/demo"
    "math/rand"
    "os"
    "time"
    "k8s.io/component-base/logs"
    "k8s.io/kubernetes/cmd/kube-scheduler/app"
)

func main() {
    rand.Seed(time.Now().UnixNano())
    command := app.NewSchedulerCommand(
        app.WithPlugin(noderesources.AllocatableName, demo.NewDemoPlugin),
    )
    logs.InitLogs()
    defer logs.FlushLogs()
    if err := command.Execute(); err != nil {
        os.Exit(1)
    }
}
apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: false
clientConnection:
  kubeconfig: "/Users/zcy/newgo/zcy-scheduler/scheduler.conf"
profiles:
- schedulerName: zcy-scheduler
  plugins:
    score:
      enabled:
      - name: demo-plugin
      disabled:
      - name: "*"

The article concludes that understanding the scheduler’s two‑loop architecture, its mechanisms for spreading pod replicas and handling resource fragmentation, and mastering the Scheduling Framework are essential for building custom scheduling solutions in Kubernetes.

References: custom‑kube‑scheduler tutorial, scheduler‑plugins repository, and a related article on the Geekbang platform.

cloud-nativekubernetesGoSchedulerCustom PluginScheduling Framework
政采云技术
Written by

政采云技术

ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.