Cloud Native 17 min read

Understanding cgroup and namespace in Linux for Cloud‑Native Containers

This article explains the role of Linux cgroup and namespace technologies in providing resource isolation and security for containers, traces their historical development from early chroot mechanisms to modern Docker and Kubernetes, and details cgroup architecture, core files, migration, delegation, and practical usage examples.

Cloud Native Technology Community

Dec 2, 2021

Understanding cgroup and namespace in Linux for Cloud‑Native Containers

Container and virtualization technologies achieve resource isolation through Linux kernel features called cgroup and namespace. cgroup manages resource allocation and limits, while namespace provides isolated views of global resources for processes.

The article reviews the rapid growth of cloud‑native/container ecosystems, starting from Unix V7's chroot jail in 1979, through Docker (2013) and Kubernetes (2014), highlighting why resource distribution and security have become hot topics.

It discusses the security shortcomings of chroot (root can escape) and Linux VServer (vulnerable to "chroot‑again" attacks), illustrating with a Go code example that demonstrates chroot behavior for normal users versus sudo:

package main

import (
    "log"
    "os"
    "syscall"
)

func getWd() (path string) {
    path, err := os.Getwd()
    if err != nil {
        log.Println(err)
    }
    log.Println(path)
    return
}

func main() {
    RealRoot, err := os.Open("/")
    defer RealRoot.Close()
    if err != nil {
        log.Fatalf("[ Error ] - /: %v
", err)
    }
    path := getWd()
    err = syscall.Chroot(path)
    if err != nil {
        log.Fatalf("[ Error ] - chroot: %v
", err)
    }
    getWd()
    err = RealRoot.Chdir()
    if err != nil {
        log.Fatalf("[ Error ] - chdir(): %v", err)
    }
    getWd()
    err = syscall.Chroot(".")
    if err != nil {
        log.Fatalf("[ Error ] - chroot back: %v", err)
    }
    getWd()
}

Running the program as a normal user fails with "operation not permitted", while using sudo succeeds, showing the need for elevated privileges to change the root filesystem.

The article then outlines modern container advantages such as lightweight nature, isolation, standardization via images, DevOps/GitOps support, and improved reliability.

It defines cgroup as a Linux kernel feature introduced in 2008 (kernel v2.6.24) that limits CPU, memory, network, and disk I/O for a group of processes. Two major versions exist: cgroup v1 and v2, with the article focusing on v2.

cgroup consists of a core hierarchy and controllers. Core files include cgroup.type, cgroup.procs, cgroup.controllers, cgroup.subtree_control, cgroup.events, cgroup.threads, cgroup.max.descendants, cgroup.max.depth, cgroup.stat, cgroup.freeze, and cgroup.kill. Each file controls specific aspects such as type, process membership, available controllers, subtree activation, event flags, thread IDs, limits on descendants, depth, statistics, freezing, and killing of groups.

Processes belong to a single cgroup; they can be moved between cgroups by writing their PID to another cgroup's cgroup.procs file. Migration is costly and does not dynamically apply resource limits, so it is rarely used.

cgroups form a tree structure where resource limits flow from parent to children. The article illustrates with diagrams how parent cgroup1 limits CPU and memory, affecting child cgroups2‑4, and explains that child cgroups cannot compete with their parent for resources.

Mounting and delegating cgroups involve options like memory_recursiveprot, memory_localevents, and nsdelegate. Delegation grants users write access to cgroup.procs, cgroup.threads, and cgroup.subtree_control files, enabling them to create sub‑hierarchies under the delegated node.

The resource allocation model includes weight, limit, protect, and allocate parameters, providing functions such as resource limiting, prioritization, auditing, and process control.

Differences between cgroup v1 and v2 are highlighted: v2 removes multi‑hierarchy support, deprecated files like tasks, the cgroup.clone_children controller, and replaces /proc/cgroups with cgroup.controllers. v1’s issues with mounting, hierarchy management, and inability to unmount subsystems are also discussed.

Finally, a Docker example shows how container CPU and memory limits map to cgroup files cpu.max and memory.max:

➜  ~ docker run --rm -d --cpus=2 --memory=2g --name=2c2g redis:alpine
...container id...
➜  ~ cat /sys/fs/cgroup/system.slice/docker-`docker ps -lq --no-trunc`.scope/cpu.max
200000 100000
➜  ~ cat /sys/fs/cgroup/system.slice/docker-`docker ps -lq --no-trunc`.scope/memory.max
2147483648

➜  ~ docker run --rm -d --cpus=0.5 --memory=0.5g --name=0.5c0.5g redis:alpine
...container id...
➜  ~ cat /sys/fs/cgroup/system.slice/docker-`docker ps -lq --no-trunc`.scope/cpu.max
50000 100000
➜  ~ cat /sys/fs/cgroup/system.slice/docker-`docker ps -lq --no-trunc`.scope/memory.max
536870912

These commands demonstrate that Docker configures the underlying cgroup files to enforce the specified resource quotas.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Docker Linux container cgroup Namespace Resource Isolation

Written by

Cloud Native Technology Community

The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.