Fundamentals 27 min read

Inside Linux Cgroup CPU Subsystem: How Containers Get CPU Time Controlled

This article provides a detailed, code‑driven explanation of how Linux cgroup’s CPU subsystem manages container CPU usage, covering cgroup creation, limit configuration, kernel object relationships, scheduler integration, bandwidth enforcement, and the role of period and slack timers.

Liangxu Linux
Liangxu Linux
Liangxu Linux
Inside Linux Cgroup CPU Subsystem: How Containers Get CPU Time Controlled

Why containers need a special look at CPU

Most modern services run inside containers, but the "CPU cores" shown to a container are not physical cores. Understanding container CPU performance therefore requires a deep dive into Linux’s cgroup CPU subsystem.

cgroup’s CPU subsystem at a glance

cgroup (control groups) lets the kernel enforce fine‑grained resource limits such as CPU and memory. On a Linux host you can list supported subsystems with:

$ lssubsys -a
cpuset
cpu,cpuacct
...

The cpu controller limits CPU time, while cpuset assigns specific logical CPUs. The interface is exposed through the virtual filesystem /sys/fs/cgroup, where each controller appears as a directory.

Creating a cgroup and setting limits

To restrict a process to two logical CPUs you can create a cgroup and write to its control files:

# cd /sys/fs/cgroup/cpu,cpuacct
# mkdir test
# cd test
# echo 100000 > cpu.cfs_period_us   # 100 ms period
# echo 200000 > cpu.cfs_quota_us    # allow 200 ms of CPU per period (≈2 cores)
# echo $PID > cgroup.procs
cfs_period_us

defines the length of a scheduling period, and cfs_quota_us defines how much CPU time a group may consume within that period.

Kernel objects behind the scenes

When a cgroup directory is created the kernel allocates several internal structures:

// include/linux/cgroup-defs.h
struct cgroup {
    struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
    ...
};

// include/linux/sched.h
struct task_struct {
    struct css_set __rcu *cgroups;
    ...
};

// include/linux/cgroup-defs.h
struct css_set {
    struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
    ...
};

Each cgroup holds an array of cgroup_subsys_state objects, one per enabled controller. The CPU controller’s state is represented by a task_group, which extends cgroup_subsys_state and contains scheduling entities, per‑CPU runqueues ( cfs_rq), and a cfs_bandwidth structure that stores the period, quota, and runtime accounting.

cgroup object diagram
cgroup object diagram

Step‑by‑step implementation of the CPU controller

1. Creating the cgroup object

The VFS call vfs_mkdir eventually reaches cgroup_mkdir, which allocates a struct cgroup and creates the corresponding directory entry in the virtual filesystem. During this process css_create is invoked for each subsystem, allocating a task_group for the CPU controller via cpu_cgroup_css_alloc.

// kernel/cgroup/cgroup.c
static struct kernfs_syscall_ops cgroup_kf_syscall_ops = {
    .mkdir = cgroup_mkdir,
    .rmdir = cgroup_rmdir,
    ...
};

int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode) {
    struct cgroup *cgrp = cgroup_create(parent);
    struct kernfs_node *kn = kernfs_create_dir(parent_kn, name, mode, cgrp);
    cgrp->kn = kn;
    ...
}

2. Configuring CPU limits

Writing to cpu.cfs_quota_us and cpu.cfs_period_us triggers the handlers cpu_cfs_quota_write_s64 and cpu_cfs_period_write_u64. Both eventually call tg_set_cfs_bandwidth, which stores the values in the cfs_bandwidth object of the associated task_group.

// kernel/sched/core.c
static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) {
    struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
    cfs_b->period = ns_to_ktime(period);
    cfs_b->quota  = quota;
    ...
}

3. Attaching a process to the cgroup

Appending a PID to cgroup.procs invokes cgroup_procs_write, which calls cgroup_attach_taskcgroup_migratecgroup_migrate_execute. For the CPU controller the attach callback is cpu_cgroup_attach, which finally calls sched_move_task to move the task into the new task_group and its runqueue.

// kernel/sched/core.c
void sched_move_task(struct task_struct *tsk) {
    struct rq *rq = task_rq_lock(tsk, &rf);
    if (task_on_rq_queued(tsk))
        dequeue_task(rq, tsk, ...);
    sched_change_group(tsk, TASK_MOVE_GROUP);
    if (queued)
        enqueue_task(rq, tsk, ...);
}

How the Completely Fair Scheduler enforces the limits

During each scheduling slice the CFS core calls update_curr, which subtracts the elapsed execution time from cfs_rq->runtime_remaining. If the value becomes negative, assign_cfs_rq_runtime pulls additional time from the group’s cfs_bandwidth. If no time is available, check_cfs_rq_runtime triggers throttle_cfs_rq, removing all entities of that runqueue from the red‑black tree so they are no longer scheduled.

// kernel/sched/fair.c
static void update_curr(struct cfs_rq *cfs_rq) {
    u64 now = rq_clock_task(rq_of(cfs_rq));
    u64 delta = now - curr->exec_start;
    account_cfs_rq_runtime(cfs_rq, delta);
}

static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta) {
    cfs_rq->runtime_remaining -= delta;
    if (cfs_rq->runtime_remaining > 0)
        return;
    if (!assign_cfs_rq_runtime(cfs_rq) && cfs_rq->curr)
        resched_curr(rq_of(cfs_rq));
}

static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {
    if (!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0)
        return false;
    throttle_cfs_rq(cfs_rq);
    return true;
}

The throttling routine extracts each sched_entity from its cfs_rq, marks the runqueue as throttled, records the timestamp, and starts the high‑resolution timer that will later unthrottle the group.

Timers that replenish CPU time

The cfs_bandwidth structure contains two hrtimers:

period_timer – fires every period (typically 100 ms) and refills cfs_b->runtime with the configured quota.

slack_timer – a short‑interval (≈5 ms) timer that wakes up throttled groups when enough runtime has been returned to the global pool.

// kernel/sched/fair.c
void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {
    cfs_b->runtime = 0;
    cfs_b->quota   = RUNTIME_INF;
    cfs_b->period  = ns_to_ktime(default_cfs_period());
    hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED);
    cfs_b->period_timer.function = sched_cfs_period_timer;
    hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
    cfs_b->slack_timer.function = sched_cfs_slack_timer;
}

void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) {
    hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
    hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
}

static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b) {
    cfs_b->slack_started = true;
    hrtimer_start(&cfs_b->slack_timer,
                  ns_to_ktime(cfs_bandwidth_slack_period),
                  HRTIMER_MODE_REL);
}

When period_timer fires, __refill_cfs_bandwidth_runtime adds the full quota to

cfs_b->runtime</>, effectively “paying a salary” to the task group. The <code>slack_timer

runs when a throttled group has returned unused runtime, allowing the kernel to unthrottle the group sooner than the next period.

Putting it all together

To limit a container’s CPU you perform three steps:

Create a cgroup directory under /sys/fs/cgroup/cpu,cpuacct (kernel creates cgroup and task_group objects).

Write the desired cpu.cfs_period_us and cpu.cfs_quota_us values; these are stored in the group’s cfs_bandwidth.

Write the container’s process IDs to cgroup.procs, which moves the tasks into the new task_group and subjects them to the runtime accounting described above.

The scheduler then deducts runtime on each slice, throttles the group when its budget is exhausted, and restores execution rights when the periodic or slack timers replenish the budget. Because the period is usually around 100 ms, a container that consumes its quota early will experience throttling for the remainder of the period, which explains why monitoring both CPU usage and throttle count/time is important for performance tuning.

Conclusion

Linux cgroup’s CPU controller does not allocate physical cores to containers; instead it grants a bounded amount of CPU execution time per scheduling period. The kernel’s CFS scheduler, together with the cfs_bandwidth data structure and high‑resolution timers, enforces these limits, providing a deterministic way to control container CPU consumption.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Linux kernelCPU schedulingcgroupresource controlcontainer performancecfs_bandwidth
Liangxu Linux
Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.