Prevent Kubernetes Cluster Collapse: Master Node Allocatable & Resource Reservations
This article explains how Kubernetes nodes schedule pods based on total capacity, why lacking resource reservations can cause node failures and cluster avalanches, and provides step‑by‑step guidance on configuring Node Allocatable, kube‑reserved, system‑reserved, and eviction settings to ensure stable cluster operation.
Node Allocatable
Kubernetes schedules pods according to a node's total resource capacity, allowing pods to use all available resources by default. Without reserving resources for system daemons, these processes compete with pods, leading to resource shortages.
In production, an uncontrolled pod can consume 100% CPU, starving the kubelet and apiserver, causing the node to become NotReady . The default behavior evicts the pod after five minutes, potentially overloading another node and triggering a cascading "cluster avalanche" where nodes sequentially become NotReady.
To avoid this, configure resource reservations using the kubelet feature
Node Allocatable, which reserves compute resources for system daemons.
Environment: Kubernetes v1.22.1, container runtime Containerd, cgroup driver systemd .
Understanding Allocatable Resources
The
Allocatablevalue represents the amount of CPU, memory, and
ephemeral‑storagethat pods can request. It is shown alongside
Capacitywhen running:
<code>kubectl describe node <node-name></code>Typical output:
<code>Capacity:
cpu: 4
memory: 7990056Ki
pods: 110
Allocatable:
cpu: 4
memory: 7887656Ki
pods: 110</code>When no reservations are set,
Capacityand
Allocatableare nearly identical. The relationship is:
<code>Node Allocatable Resource = Node Capacity - kube‑reserved - system‑reserved - eviction‑threshold</code>Pod requests summed across a node must not exceed its Allocatable value.
Configuring Resource Reservations
Reserve resources for the system using kubelet flags:
<code>--enforce-node-allocatable=pods
--kube-reserved=memory=...
--system-reserved=memory=...
--eviction-hard=...</code>For a specific node (e.g.,
node2), edit
/var/lib/kubelet/config.yaml:
<code>apiVersion: kubelet.config.k8s.io/v1beta1
enforceNodeAllocatable:
- pods
kubeReserved:
cpu: 500m
memory: 1Gi
ephemeral-storage: 1Gi
systemReserved:
memory: 1Gi
evictionHard:
memory.available: "300Mi"
nodefs.available: "10%"</code>After restarting kubelet, re‑run
kubectl describe nodeto see the reduced
Allocatablevalues, confirming the reservation calculation.
<code>Allocatable CPU: 3500m (Capacity 4 - 500m kube‑reserved)
Allocatable memory: 5585704Ki (Capacity 7990056Ki - 1Gi kube‑reserved - 1Gi system‑reserved)</code>Eviction vs OOM
Eviction is kubelet‑driven pod removal; OOM is cgroup‑triggered process kill.
Eviction thresholds (e.g.,
--eviction-hard=memory.available<20%) cause pod eviction when host memory usage exceeds 80%, but do not affect the cgroup limit
/sys/fs/cgroup/memory/kubepods.slice/memory.limit_in_bytes, which equals
capacity - kube‑reserved - system‑reserved.
Kubernetes evicts pods in order: first those without resource limits, then those with mismatched limits, and finally those with equal limits.
EnforceNodeAllocatable Details
The flag
--enforce-node-allocatableaccepts a comma‑separated list:
none,
pods,
system-reserved,
kube-reserved. Setting it to
podsenforces the allocatable constraint for pods. Adding
kube-reservedor
system-reservedrequires corresponding cgroup parameters.
For most users, enabling
enforce-node-allocatable=podsand reserving appropriate
kube‑reservedand
system‑reservedresources is sufficient to keep nodes reliable without deep cgroup tuning.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.