Mastering Linux Kernel Threads: Core Mechanisms and Scheduling
This article explains Linux kernel threads from basic concepts to deep internals, covering their data structures, creation, execution flow, scheduling strategies, context‑switch overhead, synchronization primitives, interrupt handling, and a practical kswapd memory‑reclaim case study, providing concrete code examples and step‑by‑step analysis.
1. Introduction to Kernel Threads
Kernel threads are the core execution units of the Linux kernel, handling memory management, process scheduling, and I/O processing. Unlike user‑space threads, they run entirely in kernel space and directly affect system stability and performance.
1.1 Relationship Between Threads and Processes
A process is an independent resource‑allocation unit with its own address space, code, data, and file descriptors. A thread is a CPU‑scheduling unit inside a process that shares these resources. Multiple threads in a process can run concurrently, similar to employees sharing a workspace.
1.2 Unique Identity of Kernel Threads
Each kernel thread is represented by a task_struct (the thread’s “ID card”). Important fields include:
pid : unique identifier.
state : current state such as TASK_RUNNING, TASK_INTERRUPTIBLE, or TASK_UNINTERRUPTIBLE.
mm : pointer to the memory‑management structure; for kernel threads this is NULL because they operate in kernel space.
1.3 Kernel Thread Stack
Each kernel thread has a fixed‑size stack (typically 8 KB on x86). The stack stores local variables, function arguments, and return addresses. During an interrupt, the current register state is also saved on this stack.
1.4 Context Switch Basics
When the scheduler switches from one thread to another, it saves the current thread’s registers, program counter, and stack pointer, then loads the next thread’s context. This operation incurs both time‑consumption and cache‑miss overhead.
2. Kernel Thread Internals
2.1 Key Data Structures
The three structures that form the backbone of a kernel thread are: struct thread_info: stored at the bottom of the kernel stack, contains a pointer to the associated task_struct. struct task_struct: the full descriptor holding pid, tgid, stack pointer, memory‑management pointers, file system info, etc.
Kernel stack: the memory area used for function calls and local data.
2.2 Thread Creation Workflow
Kernel thread creation is orchestrated by the kthreadd daemon. The steps are:
Call __kthread_create_on_node() to enqueue a creation request in kthread_create_list. kthreadd picks the request and invokes create_kthread(), which eventually calls kernel_thread() → _do_fork() to allocate a task_struct and a kernel stack.
The new thread’s task_struct is initialized with state, priority, and a pointer to its entry function.
Compared with process creation, kernel threads do not duplicate a user address space, making the creation path lighter.
#include <linux/module.h>
#include <linux/kthread.h>
#include <linux/delay.h>
static int my_kthread_func(void *data);
static struct task_struct *my_kthread_task = NULL;
static int __init kthread_demo_init(void)
{
my_kthread_task = kthread_run(my_kthread_func, NULL, "my_demo_kthread");
if (IS_ERR(my_kthread_task)) {
printk(KERN_ERR "Kernel thread creation failed
");
return -1;
}
printk(KERN_INFO "Kernel thread created successfully
");
return 0;
}
static int my_kthread_func(void *data)
{
while (!kthread_should_stop()) {
printk(KERN_INFO "Kernel thread running
");
msleep(1000);
}
printk(KERN_INFO "Kernel thread exiting
");
return 0;
}
module_init(kthread_demo_init);
MODULE_LICENSE("GPL");2.3 Thread Execution Flow
After creation, the thread enters the common entry kthread(), which performs generic initialization and then calls the user‑provided entry function (e.g., my_kthread_func). The thread can be in one of several states:
Running : actively executing.
Ready : waiting for CPU scheduling.
Sleeping ( TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE): waiting for an event.
Parked : a special sleep state invoked via kthread_park() and resumed with kthread_unpark().
2.4 Scheduling Strategies
The Linux scheduler (CFS since 2.6.23) uses two main mechanisms:
Time‑slice round‑robin to guarantee fairness.
Priority‑based pre‑emptive scheduling, where higher‑priority threads can pre‑empt lower‑priority ones. Priorities range from 0‑99 (higher number = higher priority).
The scheduler also considers wait time, CPU load, and dynamic priority adjustments.
2.5 Context‑Switch Implementation
The core entry point is schedule(), which calls __schedule(). The latter selects the next runnable thread via pick_next_task(), which for CFS chooses the task with the smallest vruntime. The actual switch is performed by context_switch(), which swaps address spaces (a no‑op for kernel threads) and registers via switch_to().
asmlinkage __visible void __sched schedule(void)
{
struct task_struct *prev, *next;
struct rq *rq;
int cpu = smp_processor_id();
rq = cpu_rq(cpu);
prev = current;
__schedule(false);
}
static void __sched __schedule(bool preempt)
{
struct task_struct *prev, *next;
struct rq *rq;
prev = current;
rq = cpu_rq(smp_processor_id());
clear_tsk_need_resched(prev);
next = pick_next_task(rq, prev, &rf);
if (likely(prev != next))
context_switch(rq, prev, next);
}3. Synchronization and Interrupt Handling
When multiple kernel threads share resources, Linux provides mutexes, spinlocks, semaphores, and condition variables. Mutexes block a thread until the lock is released, while spinlocks busy‑wait, suitable for very short critical sections. Semaphores allow a limited number of concurrent holders, and condition variables wake waiting threads when a specific condition becomes true.
3.1 Interrupt Interaction
Hardware interrupts pre‑empt the current thread, save its context, and jump to an interrupt handler. To avoid long‑running handlers, Linux defers heavy work to soft‑irqs or tasklets, which are implemented as kernel threads.
4. Kernel Thread Scheduling Management
The scheduler aims to maximize CPU utilization while preserving fairness and responsiveness. It relies on two watermarks per memory zone (WMARK_LOW, WMARK_HIGH) to decide when to wake background reclaim threads such as kswapd.
4.1 kswapd Wake‑up Logic
If free memory falls below WMARK_LOW, kswapd should be awakened to reclaim pages. The function zone_watermark_ok() determines whether the current free pages satisfy the required watermark plus the allocation order.
static int zone_watermark_ok(struct zone *zone, unsigned int order,
unsigned long mark, int classzone_idx,
int alloc_flags)
{
unsigned long free = zone_page_state(zone, NR_FREE_PAGES);
unsigned long cma_free = zone_page_state(zone, NR_FREE_CMA_PAGES);
free += cma_free;
if (free >= mark + (1UL << order) - 1)
return 1;
return 0;
}5. Common Sleep Issues in Kernel Threads
Calling msleep(), schedule(), or similar functions changes the thread state to a sleeping state and places it on a wait queue. Over‑long sleeps can cause latency spikes, while sleeping while holding locks can lead to deadlocks.
5.1 Example of Improper Sleep Duration
#include <linux/module.h>
#include <linux/kthread.h>
#include <linux/delay.h>
static int monitor_function(void *data)
{
while (!kthread_should_stop()) {
/* check file changes */
msleep(100); // may be too long for a responsive monitor
}
return 0;
}Reducing the sleep to 10 ms improves responsiveness; increasing it saves CPU when latency is not critical.
5.2 Sleep‑Induced Deadlock Example
static int thread1_function(void *data)
{
while (!kthread_should_stop()) {
down(&sem1);
/* ... */
down(&sem2); // deadlock if thread2 holds sem2 and waits for sem1
up(&sem2);
up(&sem1);
msleep(100);
}
return 0;
}The fix is to acquire semaphores in a consistent global order (e.g., always sem1 then sem2).
6. Case Study: kswapd Memory‑Reclaim Failure on an Embedded ARM System
An embedded Linux 5.10 system experienced frequent “memory allocation stall” messages (order:2, mode:0x2040dc0). Investigation showed that kswapd remained asleep even when free memory dropped below WMARK_LOW, causing direct reclaim and process‑level blocking for several seconds.
Root cause: a custom kernel modification altered zone_watermark_ok() and relaxed the watermark check, preventing kswapd from being triggered.
Fix: restore the original condition (free ≥ mark + (1 << order) ‑ 1). After recompiling and flashing the kernel, kswapd wakes correctly, memory levels rise back to WMARK_HIGH, and the direct‑reclaim logs disappear.
# watch -n 1 "cat /proc/meminfo | grep MemFree; ps aux | grep kswapd"This demonstrates how a subtle change in a core memory‑management function can cascade into system‑wide latency issues, and how the combination of /proc diagnostics and source‑code review can pinpoint the defect.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Deepin Linux
Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
