Deep Dive into Linux cgroup CPU Subsystem and Container CPU Bandwidth Control
This article explains how Linux cgroup’s CPU controller works, covering the creation of cgroups, the kernel structures involved, how CPU time limits are configured via cfs_period_us and cfs_quota_us, how processes are attached to cgroups, and the scheduling mechanisms that enforce bandwidth limits in containers.
1. cgroup CPU Subsystem
Under Linux, cgroups (control groups) provide fine‑grained resource control for CPU, memory, and other resources. The CPU controller is implemented through the cpu and cpuacct subsystems, which limit CPU usage by execution time rather than by binding processes to specific logical cores.
You can list the supported subsystems on your machine with:
$ lssubsys -a
cpuset
cpu,cpuacct
...The cpu subsystem controls CPU usage via execution time, while cpuset assigns logical CPUs. cgroupfs is mounted at /sys/fs/cgroup ; modifying files under this virtual filesystem changes resource limits.
To restrict a process to two logical CPUs, you can create a cgroup and set the following files:
# cd /sys/fs/cgroup/cpu,cpuacct
# mkdir test
# cd test
# echo 100000 > cpu.cfs_period_us // 100 ms period
# echo 100000 > cpu.cfs_quota_us // 200 ms quota (allows two CPUs)
# echo $pid > cgroup.procscfs_period_us defines the length of a scheduling period, and cfs_quota_us defines the total CPU time a cgroup may consume within that period. Setting the quota to twice the period limits the cgroup to the equivalent of two CPUs.
Docker uses the cgroupfs driver by default; you can verify this with:
# docker info | grep cgroup
Cgroup Driver: cgroupfs2. Relationship Between Kernel Objects, Processes, and cgroups
Creating a directory under /sys/fs/cgroup/cpu,cpuacct creates a struct cgroup object. Adding a process PID to cgroup.procs links the process’s task_struct to that cgroup.
2.1 cgroup Kernel Object
Each cgroup contains an array of cgroup_subsys_state structures, one for each subsystem (cpu, memory, etc.). The actual CPU‑specific control data lives in struct task_group , which extends cgroup_subsys_state .
2.2 Process and cgroup Subsystems
A Linux process can be associated with multiple subsystems. The task_struct contains a pointer to a css_set , which holds an array of cgroup_subsys_state pointers, establishing a many‑to‑many relationship between processes and cgroups.
// include/linux/sched.h
struct task_struct {
...
struct css_set __rcu *cgroups;
...
};
// include/linux/cgroup-defs.h
struct css_set {
...
struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
};2.3 Kernel Object Relationship Diagram
2.4 CPU Subsystem Details
The CPU controller’s core object is struct task_group . Important fields include:
// kernel/sched/sched.h
struct task_group {
struct cgroup_subsys_state css;
...
struct sched_entity **se; // one per CPU
struct cfs_rq **cfs_rq; // one per CPU
struct cfs_bandwidth cfs_bandwidth; // bandwidth limits
...
};Each task_group forms a node in a tree rooted at root_task_group . The scheduler first selects a task_group , then picks a runnable entity from its cfs_rq queue.
3. Implementation of the CPU Subsystem
The three steps to enforce a CPU limit are:
Create a cgroup directory.
Write limit values to cpu.cfs_period_us and cpu.cfs_quota_us .
Add the target process PID to cgroup.procs .
3.1 Creating a cgroup Object
Directory creation triggers cgroup_mkdir , which calls css_create to allocate subsystem‑specific objects (e.g., task_group for the CPU subsystem).
// kernel/cgroup/cgroup.c
static struct kernfs_syscall_ops cgroup_kf_syscall_ops = {
.mkdir = cgroup_mkdir,
.rmdir = cgroup_rmdir,
...
};The call chain is:
vfs_mkdir → kernfs_iop_mkdir → cgroup_mkdir → cgroup_apply_control_enable → css_create → cpu_cgroup_css_alloc3.2 Setting CPU Limits
Writing to cfs_quota_us and cfs_period_us invokes cpu_cfs_quota_write_s64 and cpu_cfs_period_write_u64 , which ultimately call tg_set_cfs_bandwidth to store the values in the cfs_bandwidth structure of the task_group .
// kernel/sched/core.c
static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) {
struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
cfs_b->period = ns_to_ktime(period);
cfs_b->quota = quota;
...
}3.3 Adding a Process to a cgroup
Writing a PID to cgroup.procs calls cgroup_procs_write , which invokes cgroup_attach_task → cgroup_migrate → cgroup_migrate_execute . For the CPU subsystem, the attach method is cpu_cgroup_attach, which finally calls sched_move_task to move the task into the new task_group .
// kernel/sched/core.c
void sched_move_task(struct task_struct *tsk) {
struct rq *rq = task_rq_lock(tsk, &rf);
if (task_on_rq_queued(tsk))
dequeue_task(rq, tsk, queue_flags);
sched_change_group(tsk, TASK_MOVE_GROUP);
if (queued)
enqueue_task(rq, tsk, queue_flags);
...
}4. CPU Bandwidth Enforcement During Scheduling
When the scheduler picks the next task ( pick_next_task_fair ), it updates the runtime of the current cfs_rq via update_curr . This subtracts the elapsed execution time from runtime_remaining . If the remaining time becomes negative, the scheduler calls assign_cfs_rq_runtime to request more time from the parent cfs_bandwidth . If no time is available, check_cfs_rq_runtime throttles the entire cfs_rq using throttle_cfs_rq .
4.1 Updating and Requesting Runtime
// kernel/sched/fair.c
static void update_curr(struct cfs_rq *cfs_rq) {
u64 now = rq_clock_task(rq_of(cfs_rq));
u64 delta_exec = now - curr->exec_start;
account_cfs_rq_runtime(cfs_rq, delta_exec);
}
static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) {
cfs_rq->runtime_remaining -= delta_exec;
if (likely(cfs_rq->runtime_remaining > 0))
return;
if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
resched_curr(rq_of(cfs_rq));
}The request size is calculated as:
min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;where sched_cfs_bandwidth_slice() reads the sysctl kernel.sched_cfs_bandwidth_slice_us (default 5000 µs).
4.2 Throttling When Bandwidth Is Exhausted
// kernel/sched/fair.c
static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {
if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
return false;
throttle_cfs_rq(cfs_rq);
return true;
}
static void throttle_cfs_rq(struct cfs_rq *cfs_rq) {
struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
for_each_sched_entity(se) {
struct cfs_rq *qcfs_rq = cfs_rq_of(se);
if (dequeue)
dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
}
cfs_rq->throttled = 1;
cfs_rq->throttled_clock = rq_clock(rq_of(cfs_rq));
start_cfs_bandwidth(cfs_rq->tg->cfs_bandwidth);
}5. Periodic Allocation of Runtime
The cfs_bandwidth structure contains two high‑resolution timers: period_timer (default 100 ms) and slack_timer (default 5 ms). The period timer periodically refills cfs_bandwidth.runtime with the configured quota, while the slack timer quickly returns unused time when a throttled queue becomes runnable again.
// kernel/sched/fair.c
struct cfs_bandwidth {
ktime_t period;
u64 quota;
u64 runtime;
struct hrtimer period_timer;
struct hrtimer slack_timer;
...
};Initialization sets the callbacks:
void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {
cfs_b->runtime = 0;
cfs_b->quota = RUNTIME_INF;
cfs_b->period = ns_to_ktime(default_cfs_period());
hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
cfs_b->period_timer.function = sched_cfs_period_timer;
hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
cfs_b->slack_timer.function = sched_cfs_slack_timer;
}When the period timer fires, __refill_cfs_bandwidth_runtime replenishes the runtime:
void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) {
if (cfs_b->quota != RUNTIME_INF)
cfs_b->runtime = cfs_b->quota;
}The slack timer is started when a throttled queue returns unused time; it runs after a short 5 ms delay to unblock the queue earlier than the next period.
6. Summary
Linux cgroups implement CPU bandwidth control by converting logical CPU shares into execution‑time quotas stored in cfs_bandwidth . A cgroup’s cpu.cfs_quota_us and cpu.cfs_period_us files define the maximum CPU time per period. The kernel creates corresponding task_group objects, attaches processes via cgroup.procs , and the Completely Fair Scheduler (CFS) enforces the limits by tracking runtime_remaining in each cfs_rq . When the runtime is exhausted, the queue is throttled; periodic and slack timers later replenish or return unused time, allowing throttled tasks to run again. Understanding both the usage percentage and the throttle count/time is essential for accurate container CPU performance analysis.
Refining Core Development Skills
Fei has over 10 years of development experience at Tencent and Sogou. Through this account, he shares his deep insights on performance.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.