Fundamentals 17 min read

An Overview of Linux Task Scheduling: History, Concepts, and Implementation

This article surveys the evolution, core concepts, and concrete implementation details of Linux task scheduling, illustrating kernel data structures, scheduling entities, run‑queue management, and the CFS algorithm with code excerpts and explaining how the scheduler balances interactive and background workloads.

Tencent Database Technology
Tencent Database Technology
Tencent Database Technology
An Overview of Linux Task Scheduling: History, Concepts, and Implementation

Abstract This article (containing code excerpts) introduces the development history and basic principles of Linux task scheduling. Over the years, kernel developers have sought methods that maximize resource utilization for high‑load background tasks while preserving good interactive performance on desktop systems, yet no perfect solution exists. The article uses Linux code to illustrate concepts and mentions related topics such as databases and language runtimes.

Scheduling Entities

Process tasks usually contain one or more thread tasks. Unlike user‑space threads, kernel threads have no user‑space address space and live entirely in kernel space. Kernel threads maintain system operation and can only be started or stopped by the kernel itself. The kernel provides interfaces for managing kernel threads, for example:

/* Simple interface for creating and stopping kernel threads without mess. */
…
struct task_struct *kthread_create_on_node(int (*threadfn)(void *data), void *data, int node, const char namefmt[], ...);

#define kthread_create(threadfn, data, namefmt, arg...) \
      kthread_create_on_node(threadfn, data, -1, namefmt, ##arg)

In Linux, the scheduling unit is the kernel thread, and the terms “thread” and “process” are often used interchangeably.

Each task in Linux is described by a C struct . Historically this structure is large and complex, containing identifiers, parent/child relationships, registers, resource descriptors, and more. Linux uses a circular doubly‑linked list to manage tasks.

struct task_struct { 
  volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */ 
  void *stack; 
  … 
  unsigned int flags; /* per process flags */ 
  … 
  struct mm_struct *mm; 
  … 
  pid_t pid; 
  pid_t tgid; 
  … 
  struct task_struct __rcu *real_parent; /* real parent process */ 
  struct list_head children; /* list of my children */ 
  struct list_head sibling; /* linkage in my parent's children list */
  ...
}

Because kernel‑thread stacks are limited, the kernel wraps task_struct inside another structure:

struct thread_info {
  struct task_struct *task; /* main task structure */
  struct exec_domain *exec_domain; /* execution domain */
  __u32 flags; /* low level flags */
  __u32 status; /* thread synchronous flags */
  __u32 cpu; /* current CPU */
  int saved_preempt_count;
  mm_segment_t addr_limit;
  struct restart_block restart_block;
  void __user *sysenter_return;
  unsigned int uaccess_err:1; /* uaccess failed */
};

Iterating over a list of thousands of tasks would be costly, so Linux provides the current macro to obtain the currently executing task. On x86, the current thread info is retrieved from the kernel stack:

static inline struct thread_info *current_thread_info(void)
{ 
  struct thread_info *ti; 
  ti = (void *)(this_cpu_read_stable(kernel_stack) + 
                 KERNEL_STACK_OFFSET - THREAD_SIZE); 
  return ti;
}

After obtaining thread_info , the task can be accessed via current = current_thread_info()->task; . Since kernel 4.1 the kernel stack layout changed, and from 4.9 onward the task pointer is no longer obtained through thread_info .

Task structures are allocated using the SLAB allocator, which maintains separate caches (slabs) for different object types such as task_struct , inodes, and memory descriptors. Although the SLAB allocator has become a performance bottleneck on modern hardware, it remains a core part of Linux memory management.

Linux also implements a classic doubly‑linked list where the list node is embedded in the containing object:

struct list_head {
  struct list_head *next, *prev;
};

struct task {
  int num; /* number of the task to do */
  bool is_awesome; /* is it something awesome? */
  struct list_head mylist; /* the list of all tasks */
};

Common list‑operation macros are defined as:

#define LIST_HEAD_INIT(name) { &(name), &(name) }
#define LIST_HEAD(name) \
  struct list_head name = LIST_HEAD_INIT(name)
static inline void INIT_LIST_HEAD(struct list_head *mylist)
{
  list->next = mylist;
  list->prev = mylist;
}

Typical usage:

struct task *some_task;
/* … allocate some_task … */
some_task->num = 1;
some_task->is_awesome = false;
INIT_LIST_HEAD(&some_task->mylist);

Linux provides O(1) list primitives such as list_add , list_add_tail , and list_del :

static inline void list_add(struct list_head *new,
                            struct list_head *head);
static inline void list_add_tail(struct list_head *new,
                                  struct list_head *head);
static inline void list_del(struct list_head *entry);

Iterating over the list can be done with:

struct list_head *h;
struct task *t;
list_for_each(h, &task_list) {
  t = list_entry(h, struct task_list, mylist);
  if (t->num == 1)
    // do something
}

A more convenient macro is list_for_each_entry(pos, head, member) , which directly yields the containing object.

Scheduling Algorithms – Concepts

Linux is a classic multitasking system: many tasks can be runnable concurrently, though not necessarily in parallel. The scheduler decides, at each timer tick (typically 1000 Hz), which task should run on which CPU for the next time slice.

Tasks are classified as interactive or non‑interactive. Non‑interactive tasks receive longer, unpreemptible time slices but are scheduled less frequently, while interactive tasks are scheduled more often with shorter slices to preserve responsiveness.

The task structure relevant to scheduling contains fields such as:

struct task_struct {
  int prio, static_prio, normal_prio;
  unsigned int rt_priority;
  const struct sched_class *sched_class;
  struct sched_entity se;
  struct sched_rt_entity rt;
  …
  unsigned int policy;
  cpumask_t cpus_allowed;
  …
};

Key fields:

prio          → priority (higher value = higher priority)
policy        → scheduling policy (e.g., SCHED_NORMAL)
cpus_allowed → CPUs on which the task may run (affinity)

Priority influences how often a task is chosen. The default nice value is 0, ranging from –20 (high priority) to +19 (low priority). Users can only increase their own nice value. This article focuses on normal (CFS) scheduling; hard real‑time scheduling is not supported by vanilla Linux, while soft real‑time is handled via real‑time exceptions.

Scheduling Algorithms – Implementation

Relevant source files in the Linux tree:

core.c – core scheduler code

fair.c – implementation of the Completely Fair Scheduler (CFS)

rt.c – real‑time scheduling code

idle_task.c – idle‑task scheduling

All scheduling policies share a common interface defined by struct sched_class :

struct sched_class {
  const struct sched_class *next;
  void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
  void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
  void (*yield_task) (struct rq *rq);
  …
  void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);
  struct task_struct *(*pick_next_task) (struct rq *rq, struct task_struct *prev);
  …
  void (*set_curr_task) (struct rq *rq);
  void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
  void (*task_fork) (struct task_struct *p);
  void (*task_dead) (struct task_struct *p);
  …
};

Task‑state transitions are handled by functions such as enqueue_task , dequeue_task , yield_task , pick_next_task , and task_tick .

The scheduler operates on a per‑CPU run‑queue ( struct rq ) that holds all runnable tasks for that CPU. Each CPU has its own run‑queue, and tasks may migrate between queues for load‑balancing.

/* This is the main, per-CPU runqueue data structure. */
struct rq {
  …
  unsigned int nr_running;   /* number of runnable tasks */
  unsigned long cpu_load[CPU_LOAD_IDX_MAX];
  struct load_weight load;   /* aggregate load */
  struct cfs_rq cfs;         /* CFS‑specific data */
  struct rt_rq rt;           /* RT‑specific data */
  struct task_struct *curr, *idle, …;
  u64 clock;
  int cpu;                    /* CPU identifier */
  …
};

nr_running – current runnable task count
cpu_load   – load information for the CPU
curr       – pointer to the currently executing task
cfs        – CFS‑related data

The central scheduling function schedule() is invoked by the timer interrupt handler scheduler_tick() , which runs at HZ frequency. The tick handler updates the run‑queue clock, calls the current task’s task_tick method, and updates CPU load statistics.

/*
 * This function gets called by the timer code, with HZ frequency.
 * We call it with interrupts disabled.
 */
void scheduler_tick(void)
{
  int cpu = smp_processor_id();
  struct rq *rq = cpu_rq(cpu);
  struct task_struct *curr = rq->curr;
  …
  update_rq_clock(rq);
  curr->sched_class->task_tick(rq, curr, 0);
  update_cpu_load_active(rq);
  …
}

Depending on the task’s scheduling policy, the appropriate task_tick implementation is invoked (e.g., CFS calls task_tick_fair() ). When a task blocks (e.g., waiting for I/O), it changes state to TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE , and schedule() selects the next runnable task. If a higher‑priority task becomes runnable, schedule() may be invoked again immediately.

The next article will continue to dissect the Linux scheduler code, trace its historical evolution, and discuss practical implications for database workloads.

Conclusion

This article introduced basic concepts of Linux task scheduling and presented relevant kernel code snippets. Scheduling is a vast and complex topic; future posts will explore deeper implementation details and illustrate real‑world scenarios such as database systems, providing test data and conclusions.

Tencent Database Technology Team supports internal services like WeChat Red Packets, lottery, and Data Bank, and provides external cloud database products such as CDB, CTSDB, CKV, and CMongo. The team focuses on enhancing database kernel features, improving performance, ensuring stability, and sharing production knowledge.
kerneltask schedulingSchedulerLinuxOperating SystemCFS
Tencent Database Technology
Written by

Tencent Database Technology

Tencent's Database R&D team supports internal services such as WeChat Pay, WeChat Red Packets, Tencent Advertising, and Tencent Music, and provides external support on Tencent Cloud for TencentDB products like CynosDB, CDB, and TDSQL. This public account aims to promote and share professional database knowledge, growing together with database enthusiasts.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.