Understanding Linux CGroup Internals: Key Structures and Resource Control
This article walks through the Linux 2.6.25 CGroup implementation by examining core kernel structures such as cgroup, cgroup_subsys_state, mem_cgroup, css_set, and cgroup_subsys, explaining how they form hierarchical resource control, how mounting and task attachment work, and how memory limits are enforced.
cgroup Structure
The cgroup struct describes a control group that manages resource usage for a set of processes. Its fields include flags, a reference count, sibling/children list heads for building a hierarchy, a parent pointer, a dentry for the virtual filesystem, an array of subsystem state pointers, a root pointer, a top‑cgroup pointer, and lists for CSS sets and release handling.
struct cgroup {
unsigned long flags; /* "unsigned long" so bitops work */
atomic_t count;
struct list_head sibling; /* my parent's children */
struct list_head children; /* my children */
struct cgroup *parent; /* my parent */
struct dentry *dentry; /* cgroup fs entry */
struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
struct cgroupfs_root *root;
struct cgroup *top_cgroup;
struct list_head css_sets;
struct list_head release_list;
};Each field serves a specific purpose: flags holds the cgroup state, count tracks how many processes use the cgroup, the list heads build the tree, parent links to the upper level, dentry represents the directory in the virtual filesystem, and subsys stores per‑subsystem statistics.
cgroup_subsys_state Structure
Each subsystem attaches a cgroup_subsys_state object to a cgroup to hold its own accounting data. The struct contains a back‑pointer to the owning cgroup, a reference counter, and flags.
struct cgroup_subsys_state {
struct cgroup *cgroup;
atomic_t refcnt;
unsigned long flags;
};mem_cgroup Structure
The memory subsystem defines mem_cgroup, which embeds a cgroup_subsys_state as its first field, followed by counters and statistics used for memory accounting.
struct mem_cgroup {
struct cgroup_subsys_state css; // note: first field
struct res_counter res;
struct mem_cgroup_lru_info info;
int prev_priority;
struct mem_cgroup_stat stat;
};css_set Structure
A css_set aggregates the subsystem state objects for all subsystems attached to a particular cgroup hierarchy, allowing a task to be linked to multiple cgroups simultaneously.
struct css_set {
struct kref ref;
struct list_head list;
struct list_head tasks;
struct list_head cg_links;
struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
};Its fields include a reference counter ( ref), a list head for linking all css_set objects, a task list, a list of cgroup links, and an array of pointers to each subsystem's state.
task_struct Integration
The kernel's task_struct contains two fields that connect a task to cgroups: a pointer to its current css_set ( cgroups) and a list head ( cg_list) used to chain the task into the css_set 's task list.
struct task_struct {
...
struct css_set *cgroups;
struct list_head cg_list;
...
};cgroup_subsys Structure
Each subsystem (e.g., memory, cpu) implements a cgroup_subsys with function pointers for lifecycle operations such as create, destroy, attach, fork, and exit. It also stores metadata like subsystem ID, activation flags, name, and links to the root cgroup.
struct cgroup_subsys {
struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss, struct cgroup *cgrp);
void (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp, struct task_struct *tsk);
void (*attach)(struct cgroup_subsys *ss, struct cgroup *cgrp, struct cgroup *old_cgrp, struct task_struct *tsk);
void (*fork)(struct cgroup_subsys *ss, struct task_struct *task);
void (*exit)(struct cgroup_subsys *ss, struct task_struct *task);
int (*populate)(struct cgroup_subsys *ss, struct cgroup *cgrp);
void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp);
void (*bind)(struct cgroup_subsys *ss, struct cgroup *root);
int subsys_id;
int active;
int disabled;
int early_init;
const char *name;
struct cgroupfs_root *root;
struct list_head sibling;
void *private;
};cgroupfs_root Structure
The mount point for a cgroup hierarchy is described by cgroupfs_root. It stores the superblock, subsystem bitmaps, a list of attached subsystems, the top‑cgroup, counters, and flags.
struct cgroupfs_root {
struct super_block *sb;
unsigned long subsys_bits;
unsigned long actual_subsys_bits;
struct list_head subsys_list;
struct cgroup top_cgroup;
int number_of_cgroups;
struct list_head root_list;
unsigned long flags;
char release_agent_path[PATH_MAX];
};Mounting a CGroup Hierarchy
To enable cgroup functionality, the hierarchy must be mounted, e.g.:
$ mount -t cgroup -o memory memory /sys/fs/cgroup/memoryThe kernel calls cgroup_get_sb(), which allocates a cgroupfs_root, binds the requested subsystems via rebind_subsystems(), and creates the top‑cgroup directory.
Adding Tasks to a CGroup
Writing a PID to a cgroup’s tasks file triggers attach_task_by_pid(), which resolves the PID to a task_struct and then calls cgroup_attach_task(). The latter finds or creates a suitable css_set, updates the task’s cgroups pointer, links the task into the css_set ’s task list, and finally invokes each subsystem’s attach() callback.
static int attach_task_by_pid(struct cgroup *cgrp, char *pidbuf) {
pid_t pid;
struct task_struct *tsk;
int ret;
if (sscanf(pidbuf, "%d", &pid) != 1)
return -EIO;
if (pid) {
tsk = find_task_by_vpid(pid);
if (!tsk || tsk->flags & PF_EXITING)
return -ESRCH;
} else {
tsk = current;
}
ret = cgroup_attach_task(cgrp, tsk);
return ret;
}Enforcing Memory Limits
Writing a byte limit to memory.limit_in_bytes updates the mem_cgroup ’s res.limit field via mem_cgroup_write():
static ssize_t mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
struct file *file, const char __user *userbuf,
size_t nbytes, loff_t *ppos) {
return res_counter_write(&mem_cgroup_from_cont(cont)->res,
cft->private, userbuf, nbytes, ppos,
mem_cgroup_write_strategy);
}When a process allocates memory, the kernel’s do_anonymous_page() calls mem_cgroup_charge(), which eventually invokes mem_cgroup_charge_common(). This function checks the res_counter against the limit, attempts to reclaim memory, and, if still over the limit, triggers an OOM event.
static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask, enum charge_type ctype) {
struct mem_cgroup *mem;
mem = rcu_dereference(mm->mem_cgroup);
while (res_counter_charge(&mem->res, PAGE_SIZE)) {
if (!(gfp_mask & __GFP_WAIT))
goto out;
if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
continue;
if (res_counter_check_under_limit(&mem->res))
continue;
if (!nr_retries--) {
mem_cgroup_out_of_memory(mem, gfp_mask);
goto out;
}
...
}
...
}Through this chain of structures and callbacks, the Linux kernel provides a flexible, hierarchical mechanism for grouping processes and enforcing resource limits such as memory usage.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
