Operations 55 min read

How Modern Schedulers Work: From Linux Kernels to Go Runtime and Kubernetes

This article explores the design principles, evolution, and implementation details of schedulers across operating systems, the Go programming language runtime, and Kubernetes, covering concepts such as task and resource allocation, cooperative vs. preemptive scheduling, work‑stealing, priority algorithms, and the latest scheduling framework extensions.

Alibaba Cloud Native

Feb 6, 2020

How Modern Schedulers Work: From Linux Kernels to Go Runtime and Kubernetes

Introduction

The term scheduler refers to any component that assigns work (tasks, threads, pods, etc.) to limited resources (CPU, memory, nodes). Efficient scheduling is crucial for maximizing resource utilization, minimizing cost, and meeting quality‑of‑service goals in both system software and cloud platforms.

1. Scheduling System Types

1.1 Long‑Term, Mid‑Term, and Short‑Term Schedulers (Operating System)

Operating systems classify schedulers into three types:

Long‑term (job) scheduler decides which jobs enter the ready queue, balancing I/O‑bound and CPU‑bound processes.

Mid‑term scheduler swaps out low‑priority or memory‑heavy processes to free resources.

Short‑term (CPU) scheduler selects a process from the ready queue for execution, running frequently.

2. Design and Evolution of Linux Schedulers

2.1 Initial Scheduler (v0.01‑v2.4)

The earliest Linux scheduler consisted of a schedule() function with only a few dozen lines of code, handling up to 64 tasks stored in a static array. It selected the task with the highest counter (time slice) and performed a context switch.

void schedule(void) {
    int i, next, c;
    struct task_struct **p;
    for (p = &LAST_TASK; p > &FIRST_TASK; --p) { ... }
    while (1) {
        c = -1; next = 0; i = NR_TASKS; p = &task[NR_TASKS];
        while (--i) {
            if (!*--p) continue;
            if ((*p)->state == TASK_RUNNING && (*p)->counter > c)
                c = (*p)->counter, next = i;
        }
        if (c) break;
        for (p = &LAST_TASK; p > &FIRST_TASK; --p)
            if (*p) (*p)->counter = ((*p)->counter >> 1) + (*p)->priority;
    }
    switch_to(next);
}

2.2 O(n) Scheduler (v2.4‑v2.6)

This version traversed the entire run‑queue to compute a weight for each task, resulting in O(n) complexity and noticeable latency for interactive workloads.

2.3 O(1) Scheduler (v2.6.0‑v2.6.22)

Introduced per‑CPU run queues and priority arrays, achieving constant‑time scheduling. The data structures include runqueue and prio_array with bitmap‑based fast lookup.

struct runqueue {
    prio_array_t *active, *expired, *arrays[2];
    ...
};
struct prio_array {
    unsigned int nr_active;
    unsigned long bitmap[BITMAP_SIZE];
    struct list_head queue[MAX_PRIO];
};

2.4 Completely Fair Scheduler (CFS) (v2.6.23‑present)

CFS uses a red‑black tree ( cfs_rq) to keep tasks ordered by virtual runtime, guaranteeing fairness. The scheduler picks the left‑most node (the most under‑served task) and updates vruntime on each slice.

struct cfs_rq {
    struct load_weight load;
    unsigned long nr_running;
    s64 fair_clock;
    struct rb_root tasks_timeline;
    struct rb_node *rb_leftmost;
    ...
};

The scheduling steps are: disable preemption, select the next task from the red‑black tree, perform context_switch, then re‑enable preemption.

3. Go Language Scheduler

3.1 Single‑Threaded Scheduler (0.x)

Early Go used a simple C‑style scheduler with a single OS thread (M) and goroutine (G) structures. The scheduler() function saved the current M state, selected the next G, and switched to it.

static void scheduler(void) {
    G* gp;
    lock(&sched);
    if (gosave(&m->sched)) {
        lock(&sched);
        gp = m->curg;
        switch(gp->status) {
        case Grunnable:
        case Grunning:
            gp->status = Grunnable;
            gput(gp);
            break;
        }
        notewakeup(&gp->stopped);
    }
    gp = nextgandunlock();
    noteclear(&gp->stopped);
    gp->status = Grunning;
    m->curg = gp;
    gogo(&gp->sched);
}

3.2 Multi‑Threaded Scheduler (1.0)

Added multiple OS threads (M) and a global GOMAXPROCS limit. The core logic remained similar but introduced a global lock, leading to contention.

3.3 Work‑Stealing Scheduler (1.1)

Introduced the G‑M‑P model: goroutine (G), OS thread (M), and processor (P). Each P has a local run‑queue; idle P’s steal work from others. The scheduler loop now looks like:

static void schedule(void) {
    G *gp;
    top:
    if (runtime.gcwaiting) { gcstopm(); goto top; }
    gp = runqget(m->p);
    if (gp == nil) gp = findrunnable();
    execute(gp);
}

The P structure:

struct P {
    Lock;
    uint32 status; // Pidle, Prunning, ...
    M *m;          // back‑link to M (nil if idle)
    G **runq;
    int32 runqhead, runqtail, runqsize;
    G *gfree;
    int32 gfreecnt;
};

3.4 Cooperative Preemptive Scheduler (1.2‑1.13)

Added preemption points via stack guard checks. The runtime marks a G with StackPreempt when GC or a long‑running loop is detected; the next function call triggers a preemptive switch.

3.5 Asynchronous Signal‑Based Preemptive Scheduler (1.14‑present)

Implemented true preemption using SIGURG. The runtime sends the signal to a thread, the handler saves registers, and runtime.asyncPreempt forces the G into the _Gpreempted state, after which the scheduler picks another G.

3.6 NUMA Scheduler Proposal

A design proposal suggests partitioning global resources (run‑queues, memory allocators, network pollers) per NUMA node to improve locality, though it remains unimplemented.

4. Kubernetes Scheduler

4.1 Predicate‑and‑Priority Scheduler (v1.0‑v1.14)

Kubernetes originally used a two‑stage approach: predicates filter out unsuitable nodes, then priority functions score the remaining nodes. Priority scoring employed a Map‑Reduce pattern.

type FitPredicate func(pod *v1.Pod, meta PredicateMetadata, nodeInfo *schedulernodeinfo.NodeInfo) (bool, []PredicateFailureReason, error)

type PriorityMapFunction func(pod *v1.Pod, meta interface{}, nodeInfo *schedulernodeinfo.NodeInfo) (schedulerapi.HostPriority, error)

type PriorityReduceFunction func(pod *v1.Pod, meta interface{}, nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo, result schedulerapi.HostPriorityList) error

The scheduling flow:

List all nodes.

Run findNodesThatFit (parallel predicates) to obtain filteredNodes.

Run PrioritizeNodes (parallel map functions, then reduce) to produce a priorityList.

Select the node with the highest score.

4.2 Scheduling Framework (v1.15‑present)

The framework refactors the scheduler into a pipeline of extension points: QueueSort, PreFilter, Filter, PostFilter, Score, Reserve, Permit, PreBind, Bind, PostBind, and Unreserve. The core loop Scheduler.scheduleOne executes these points sequentially, allowing plugins to influence each stage.

func (sched *Scheduler) scheduleOne(ctx context.Context) {
    fwk := sched.Framework
    podInfo := sched.NextPod()
    pod := podInfo.Pod
    state := framework.NewCycleState()
    scheduleResult, _ := sched.Algorithm.Schedule(ctx, state, pod)
    assumedPod := podInfo.DeepCopy().Pod
    // Reserve resources
    if sts := fwk.RunReservePlugins(ctx, state, assumedPod, scheduleResult.SuggestedHost); !sts.IsSuccess() {
        return
    }
    // Bind phase (in a separate goroutine)
    go func() {
        fwk.RunPermitPlugins(ctx, state, assumedPod, scheduleResult.SuggestedHost)
        sched.bindVolumes(assumedPod)
        fwk.RunPreBindPlugins(ctx, state, assumedPod, scheduleResult.SuggestedHost)
        if err := sched.bind(ctx, assumedPod, scheduleResult.SuggestedHost, state); err != nil {
            fwk.RunUnreservePlugins(ctx, state, assumedPod, scheduleResult.SuggestedHost)
        } else {
            fwk.RunPostBindPlugins(ctx, state, assumedPod, scheduleResult.SuggestedHost)
        }
    }()
}

The framework enables custom plugins for filtering, scoring, and binding, making the scheduler highly extensible.

5. Comparison and Conclusions

All three schedulers share common goals—efficient resource allocation and fairness—but differ in granularity and context. Linux and Go operate at the system level, focusing on low‑latency, high‑throughput task dispatch, while Kubernetes works at the application level, emphasizing policy‑driven placement and extensibility.

Key observations:

Early versions of each scheduler were simple and cooperative; over time they adopted work‑stealing and preemptive mechanisms.

Linux introduced scheduling classes to support diverse workloads; Go added G‑M‑P and signal‑based preemption; Kubernetes moved from static predicate/priority to a plugin‑based framework.

Future work includes NUMA‑aware scheduling for Go and multi‑scheduler extensions for Kubernetes.

References

Linux kernel source files: sched.h, sched.c, sched_fair.c Go runtime source: runtime/sched.go, runtime/proc.go Kubernetes scheduler code: pkg/scheduler/framework, pkg/scheduler/apis Further reading: "Understanding the Linux 2.6.8.1 CPU Scheduler", "CFS Scheduler", "Scalable work stealing", "Scheduling Framework #624"

concurrency scheduling go runtime operating-system scheduler-design

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.