Process Management & Scheduling (Part 0): Essential Kernel Structures
This article introduces the core Linux kernel data structures involved in process management and scheduling—task_struct, sched_entity, rq, and sched_avg—explaining their key fields, relationships, and how they enable the kernel to track process state, timing, memory, and load‑balancing decisions.
The series aims to dissect Linux process management and scheduling by first covering the prerequisite kernel structures. It focuses on four central structs: task_struct, sched_entity, rq, and sched_avg, each of which encapsulates specific aspects of a process’s lifecycle and the scheduler’s operation.
task_struct
task_structis the Linux PCB (Process Control Block). In Linux 6.5 it spans over 800 lines and stores all information about a process, including identifiers, scheduling data, memory descriptors, file descriptors, and runtime statistics. Important fields highlighted are:
struct task_struct {
/* Process identifiers */
unsigned int __state; // process state
void *stack; // kernel stack pointer
refcount_t usage; // reference count
unsigned int flags; // PF_* flags
unsigned int prio; // scheduling priority
unsigned int static_prio;
unsigned int normal_prio;
unsigned int rt_priority; // real‑time priority
struct sched_entity se; // normal scheduling entity
struct sched_rt_entity rt; // real‑time entity
struct sched_dl_entity dl; // deadline entity
const struct sched_class *sched_class;
unsigned int policy; // scheduling policy
cpumask_t cpus_mask; // CPU affinity mask
int exit_state; // exit status
int exit_code; // exit code from exit()
int exit_signal; // signal sent to parent on exit
int pdeath_signal; // signal sent to child when parent dies
unsigned long nvcsw; // voluntary context switches
unsigned long nivcsw; // involuntary context switches
u64 start_time; // time when scheduled (ns)
/* ... many other fields omitted ... */
};Most fields are placed between randomized_struct_fields_start and randomized_struct_fields_end, a compiler‑level randomization to mitigate memory‑corruption attacks.
sched_entity
sched_entityrepresents the smallest scheduling unit—either a single process or a scheduling group. It holds load‑balancing weight, run‑queue node, group list node, and timing information such as virtual runtime and execution statistics.
struct sched_entity {
struct load_weight load; // weight influencing scheduling decisions
struct rb_node run_node; // node in the red‑black tree of the runqueue
struct list_head group_node; // list node for grouping entities
unsigned int on_rq; // whether the entity is on a runqueue
u64 exec_start; // start of execution (virtual time)
u64 sum_exec_runtime; // total execution time (real time)
u64 vruntime; // virtual runtime used for CFS fairness
u64 prev_sum_exec_runtime;
u64 nr_migrations; // number of migrations
#ifdef CONFIG_FAIR_GROUP_SCHED
int depth; // depth in scheduling hierarchy
struct sched_entity *parent; // parent entity in a group
struct cfs_rq *cfs_rq; // CFS runqueue this entity belongs to
struct cfs_rq *my_q; // runqueue owned by this entity/group
unsigned long runnable_weight;
#endif
#ifdef CONFIG_SMP
struct sched_avg avg; // load average for the entity
#endif
};rq (runqueue)
The rq struct describes a CPU’s generic runqueue, containing basic counters, pointers to the currently running task, idle task, and the three scheduler‑specific queues (CFS, real‑time, deadline). It also stores load‑balancing data such as CPU capacity and balance callbacks.
struct rq {
raw_spinlock_t __lock; // protects the runqueue
unsigned int nr_running; // number of runnable tasks
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
unsigned int numa_migrate_on;
#endif
u64 nr_switches; // context‑switch count
unsigned int nr_uninterruptible;
struct task_struct __rcu *curr; // currently running task
struct task_struct *idle; // idle task for this CPU
struct task_struct *stop; // stop task
u64 clock; // runqueue clock
struct cfs_rq cfs; // CFS runqueue
struct rt_rq rt; // real‑time runqueue
struct dl_rq dl; // deadline runqueue
unsigned long nr_switches;
/* Load‑balancing fields */
struct root_domain *rd;
struct sched_domain __rcu *sd;
unsigned long cpu_capacity;
unsigned long cpu_capacity_orig;
unsigned char idle_balance;
int active_balance;
int cpu; // CPU this runqueue belongs to
int online; // CPU online state
/* ... other fields omitted ... */
};sched_avg
sched_avgaggregates load information for both scheduling entities and runqueues, providing metrics such as last update time, load sum, runnable sum, utilization sum, and various averaged values used by the load‑balancing algorithm.
struct sched_avg {
u64 last_update_time; // last time the metrics were refreshed
u64 load_sum; // accumulated load (decayed over time)
u64 runnable_sum; // accumulated runnable load
u32 util_sum; // raw CPU utilization (time‑based)
u32 period_contrib; // leftover time not forming a full period
unsigned long load_avg; // quantified load for the entity/queue
unsigned long runnable_avg; // runnable load (reflects CPU load)
unsigned long util_avg; // actual CPU utilization (after weighting)
struct util_est util_est; // utilization estimator
};These structures together form the “iceberg tip” of the Linux scheduler: task_struct links a process to its memory descriptor ( mm) and file tables, sched_entity tracks its scheduling state, rq organizes runnable entities per CPU, and sched_avg supplies the metrics for load‑balancing decisions.
The article concludes that understanding these core structs is essential before diving deeper into the kernel’s scheduling algorithms and eBPF‑based runtime analysis.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
