Mastering Kubernetes Node Allocatable: Reserve Resources to Prevent Cluster Failures
Learn how Kubernetes distinguishes compressible (CPU) and non‑compressible (memory, storage) resources, why default kubelet settings can cause resource contention, and how to use the Node Allocatable feature—configuring kube‑reserved, system‑reserved, and eviction thresholds—to safely reserve resources for system daemons and avoid cluster instability.
Background
Kubernetes classifies system resources into compressible resources (CPU) and non‑compressible resources (memory, storage). By default, the kubelet does not reserve any resources, allowing all node resources to be used by Pods, which can lead to contention and OOM situations under heavy load.
Resource Reservation
When Pods compete with system daemons and Kubernetes components for resources, the node may experience OOM kills, potentially causing a cascade failure (cluster avalanche). The Node Allocatable feature reserves resources for system processes and Kubernetes components to prevent this.
Node Allocatable
The kubelet provides a Node Allocatable feature that reserves CPU, memory, and ephemeral‑storage for system daemons and Kubernetes components. The allocation diagram is shown below:
Node Capacity: total hardware resources of the node.
kube‑reserved: resources reserved for Kubernetes system processes (kubelet, container runtime, etc.).
system‑reserved: resources reserved for Linux system daemons.
eviction‑threshold: memory or storage thresholds that trigger hard eviction.
allocatable: resources available for Pods, used by the scheduler.
<code># Node Allocatable calculation formula:
allocatable = NodeCapacity - [kube-reserved] - [system-reserved] - [eviction-threshold]
</code>Parameter Meaning and Configuration
--enforce-node-allocatable: default is pods ; set to pods,kube-reserved,system-reserved to reserve resources for system components.
--cgroups-per-qos: enables QoS and pod‑level cgroups (enabled by default).
--cgroup-driver: selects the cgroup driver (default cgroupfs , alternative systemd ).
--kube-reserved: specifies resources for Kubernetes components, e.g., cpu=2000m,memory=8Gi,ephemeral-storage=16Gi .
--kube-reserved-cgroup: cgroup path for the reserved resources; must exist beforehand.
--system-reserved: specifies resources for Linux system daemons, e.g., cpu=2000m,memory=4Gi,ephemeral-storage=8Gi .
--system-reserved-cgroup: cgroup path for system‑reserved resources; must exist beforehand.
--eviction-hard: configures hard eviction thresholds for memory and ephemeral‑storage.
Configuration and Verification
To enforce cgroup‑level limits for Pods, system processes, and Kubernetes components, add the following parameters to the kubelet startup command:
<code># Add to kubelet startup parameters:
--enforce-node-allocatable=pods,kube-reserved,system-reserved \
--cgroup-driver=cgroupfs \
--kube-reserved=cpu=1,memory=1Gi,ephemeral-storage=10Gi \
--kube-reserved-cgroup=/system.slice/kubelet.service \
--system-reserved=cpu=1,memory=2Gi,ephemeral-storage=10Gi \
--system-reserved-cgroup=/system.slice
</code>Ensure the cpuset subsystem exists for the relevant cgroups; otherwise kubelet will fail to start.
<code>// Exists checks if all subsystem cgroups already exist
func (m *cgroupManagerImpl) Exists(name CgroupName) bool {
// Get map of all cgroup paths on the system for the particular cgroup
cgroupPaths := m.buildCgroupPaths(name)
// whitelist of controllers we care about
whitelistControllers := sets.NewString("cpu", "cpuacct", "cpuset", "memory", "systemd")
for controller, path := range cgroupPaths {
if !whitelistControllers.Has(controller) {
continue
}
if !libcontainercgroups.PathExists(path) {
return false
}
}
return true
}
</code> <code># Manually create cpuset cgroups:
sudo mkdir -p /sys/fs/cgroup/cpuset/system.slice
sudo mkdir -p /sys/fs/cgroup/cpuset/system.slice/kubelet.service
</code>After restarting kubelet, verify that the calculated allocatable values match the node's reported capacity and the cgroup memory limits (e.g., /sys/fs/cgroup/memory/kubepods/memory.limit_in_bytes ), confirming that resource reservation works as expected.
Best Practices
In production, limit resources for Pods, Kubernetes system components, and Linux system processes simultaneously to prevent any single class from overwhelming the node.
For system‑level Pods created by DaemonSets, set the QoS class to Guaranteed to ensure stable resource guarantees.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.