Cloud Native 9 min read

Mastering Kubernetes Node Allocatable: Reserve Resources to Prevent Cluster Failures

Learn how Kubernetes distinguishes compressible (CPU) and non‑compressible (memory, storage) resources, why default kubelet settings can cause resource contention, and how to use the Node Allocatable feature—configuring kube‑reserved, system‑reserved, and eviction thresholds—to safely reserve resources for system daemons and avoid cluster instability.

360 Zhihui Cloud Developer

Sep 17, 2019

Mastering Kubernetes Node Allocatable: Reserve Resources to Prevent Cluster Failures

Background

Kubernetes classifies system resources into compressible resources (CPU) and non‑compressible resources (memory, storage). By default, the kubelet does not reserve any resources, allowing all node resources to be used by Pods, which can lead to contention and OOM situations under heavy load.

Resource Reservation

When Pods compete with system daemons and Kubernetes components for resources, the node may experience OOM kills, potentially causing a cascade failure (cluster avalanche). The Node Allocatable feature reserves resources for system processes and Kubernetes components to prevent this.

Node Allocatable

The kubelet provides a Node Allocatable feature that reserves CPU, memory, and ephemeral‑storage for system daemons and Kubernetes components. The allocation diagram is shown below:

Node Capacity: total hardware resources of the node.

kube‑reserved: resources reserved for Kubernetes system processes (kubelet, container runtime, etc.).

system‑reserved: resources reserved for Linux system daemons.

eviction‑threshold: memory or storage thresholds that trigger hard eviction.

allocatable: resources available for Pods, used by the scheduler.

# Node Allocatable calculation formula:
allocatable = NodeCapacity - [kube-reserved] - [system-reserved] - [eviction-threshold]

Parameter Meaning and Configuration

--enforce-node-allocatable: default is pods; set to pods,kube-reserved,system-reserved to reserve resources for system components.

--cgroups-per-qos: enables QoS and pod‑level cgroups (enabled by default).

--cgroup-driver: selects the cgroup driver (default cgroupfs, alternative systemd).

--kube-reserved: specifies resources for Kubernetes components, e.g., cpu=2000m,memory=8Gi,ephemeral-storage=16Gi.

--kube-reserved-cgroup: cgroup path for the reserved resources; must exist beforehand.

--system-reserved: specifies resources for Linux system daemons, e.g., cpu=2000m,memory=4Gi,ephemeral-storage=8Gi.

--system-reserved-cgroup: cgroup path for system‑reserved resources; must exist beforehand.

--eviction-hard: configures hard eviction thresholds for memory and ephemeral‑storage.

Configuration and Verification

To enforce cgroup‑level limits for Pods, system processes, and Kubernetes components, add the following parameters to the kubelet startup command:

# Add to kubelet startup parameters:
--enforce-node-allocatable=pods,kube-reserved,system-reserved \
--cgroup-driver=cgroupfs \
--kube-reserved=cpu=1,memory=1Gi,ephemeral-storage=10Gi \
--kube-reserved-cgroup=/system.slice/kubelet.service \
--system-reserved=cpu=1,memory=2Gi,ephemeral-storage=10Gi \
--system-reserved-cgroup=/system.slice

Ensure the cpuset subsystem exists for the relevant cgroups; otherwise kubelet will fail to start.

// Exists checks if all subsystem cgroups already exist
func (m *cgroupManagerImpl) Exists(name CgroupName) bool {
    // Get map of all cgroup paths on the system for the particular cgroup
    cgroupPaths := m.buildCgroupPaths(name)

    // whitelist of controllers we care about
    whitelistControllers := sets.NewString("cpu", "cpuacct", "cpuset", "memory", "systemd")
    for controller, path := range cgroupPaths {
        if !whitelistControllers.Has(controller) {
            continue
        }
        if !libcontainercgroups.PathExists(path) {
            return false
        }
    }
    return true
}

# Manually create cpuset cgroups:
sudo mkdir -p /sys/fs/cgroup/cpuset/system.slice
sudo mkdir -p /sys/fs/cgroup/cpuset/system.slice/kubelet.service

After restarting kubelet, verify that the calculated allocatable values match the node's reported capacity and the cgroup memory limits (e.g., /sys/fs/cgroup/memory/kubepods/memory.limit_in_bytes), confirming that resource reservation works as expected.

Best Practices

In production, limit resources for Pods, Kubernetes system components, and Linux system processes simultaneously to prevent any single class from overwhelming the node.

For system‑level Pods created by DaemonSets, set the QoS class to Guaranteed to ensure stable resource guarantees.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes cgroups kubelet Node Allocatable resource reservation

Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.