Cloud Native 16 min read

Unveiling Kubernetes CPU Overcommit: Why Requests and Limits Matter

This article explains how Kubernetes uses both requests and limits to manage container CPU resources, detailing the underlying Linux cgroup mechanisms, the concept of over‑selling CPU cores, and practical guidance for SREs configuring optimal request and limit values.

Refining Core Development Skills
Refining Core Development Skills
Refining Core Development Skills
Unveiling Kubernetes CPU Overcommit: Why Requests and Limits Matter

1. Container Cloud Stack

Kubernetes (K8s) is the dominant container orchestration platform, built on Linux kernel features such as cgroups, namespaces, and rootfs. CPU resource control for containers relies on cgroups.

2. Container CPU Resource Control

2.1 CPU Bandwidth Limiting

Linux uses the cpu cgroup to enforce CPU limits. For cgroup v1, the limit can be set via the cgroup filesystem:

# cd /sys/fs/cgroup/cpu,cpuacct
# mkdir test
# cd test
# echo 100000 > cpu.cfs_period_us   // 100 ms
# echo 200000 > cpu.cfs_quota_us    // 200 ms
# echo {$pid} > cgroup.procs

The cfs_period_us defines the scheduling period, while cfs_quota_us defines the maximum CPU time a cgroup may consume within that period, effectively limiting usage to two cores.

The kernel scheduler periodically assigns CPU time to each task group via a period timer. The callback chain is:

sched_cfs_period_timer
 -> do_sched_cfs_period_timer
    -> __refill_cfs_bandwidth_runtime

The core function that refills the runtime is:

void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
{
    if (cfs_b->quota != RUNTIME_INF)
        cfs_b->runtime = cfs_b->quota;
}

During scheduling, update_curr updates the runtime remaining for each cfs_rq:

u64 now = rq_clock_task(rq_of(cfs_rq));
u64 delta_exec = now - curr->exec_start;
... 
account_cfs_rq_runtime(cfs_rq, delta_exec);

If a cgroup exhausts its allocated runtime, the kernel throttles it by removing its task group from the runqueue.

static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
    if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
        return false;
    ...
    throttle_cfs_rq(cfs_rq);
    return true;
}

2.2 CPU Weight Distribution

Beyond hard limits, the kernel also distributes CPU time proportionally using weights. In cgroup v1 this is configured via cpu.shares, and in cgroup v2 via cpu.weight or cpu.weight.nice.

The kernel stores the weight in the task group's scheduling entity:

struct task_group {
    ...
    struct sched_entity *se;
    struct cfs_rq       *cfs_rq;
    unsigned long        shares;
};
struct sched_entity {
    struct load_weight load;
    ...
};
struct load_weight {
    unsigned long weight;
};

The virtual runtime is scaled by weight:

vruntime = (actual_runtime * ((NICE_0_LOAD * 2^32) / weight)) >> 32

Higher weight yields smaller vruntime, granting more CPU time. For example, on an 8‑core machine with containers A, B, C having weights 512, 1024, and 2048 respectively, they receive 0.5, 1, and 2 cores proportionally.

3. Requests and Limits Semantics in K8s

A typical pod definition shows how limits and requests are expressed:

apiVersion: v1
kind: Pod
metadata:
  name: cpu-demo
namespace: cpu-example
spec:
  containers:
  - name: cpu-demo-ctr
    image: vish/stress
    resources:
      limits:
        cpu: "1"
      requests:
        cpu: "0.5"
limits

are enforced via cgroup period and quota, defining a hard ceiling. requests are translated into weight values (1024 per core in cgroup v1, ~39 per core in cgroup v2) and guarantee that the requested CPU is available even under contention.

K8s ensures that the sum of requests on a node never exceeds the total logical cores, while the sum of limits can be over‑committed, enabling over‑selling.

Thus a container’s usable CPU lies in the interval [requests, limits]:

Requests provide a guaranteed minimum.

If the node is idle, the container may exceed its request up to the limit.

Example: on an 8‑core host, four identical containers each have request = 2 cores (weight 2048) and limit = 3 cores (quota 300 ms). The total requests equal 8 cores, but total limits sum to 12 cores, illustrating over‑commit.

4. Summary

Kubernetes uses two fields for CPU control:

Limits set a hard upper bound to prevent a container from monopolizing CPU.

Requests allocate weight‑based shares that are guaranteed even when the node is busy.

By configuring requests to stay within the physical core count and setting limits at a reasonable multiple (e.g., 1.5×), SREs can safely over‑sell CPU resources while maintaining performance guarantees.

For a container advertised as 12 cores on an 8‑core host, the manifest would be:

apiVersion: v1
kind: Pod
metadata:
  name: cpu-demo
namespace: cpu-example
spec:
  containers:
  - name: cpu-demo-ctr
    image: vish/stress
    resources:
      limits:
        cpu: "12"
      requests:
        cpu: "8"

Even though the limit exceeds the physical cores, the container can use up to 12 cores only when other workloads are idle; otherwise it is throttled to its guaranteed 8 cores.

cloud-nativeKubernetesOvercommitcgroupscpu-limits
Refining Core Development Skills
Written by

Refining Core Development Skills

Fei has over 10 years of development experience at Tencent and Sogou. Through this account, he shares his deep insights on performance.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.