Cloud Native 10 min read

Prevent Kubernetes Cluster Collapse: Master Node Allocatable & Resource Reservations

This article explains how Kubernetes nodes schedule pods based on total capacity, why lacking resource reservations can cause node failures and cluster avalanches, and provides step‑by‑step guidance on configuring Node Allocatable, kube‑reserved, system‑reserved, and eviction settings to ensure stable cluster operation.

Ops Development Stories

Sep 9, 2021

Node Allocatable

Kubernetes schedules pods according to a node's total resource capacity, allowing pods to use all available resources by default. Without reserving resources for system daemons, these processes compete with pods, leading to resource shortages.

In production, an uncontrolled pod can consume 100% CPU, starving the kubelet and apiserver, causing the node to become NotReady . The default behavior evicts the pod after five minutes, potentially overloading another node and triggering a cascading "cluster avalanche" where nodes sequentially become NotReady.

To avoid this, configure resource reservations using the kubelet feature Node Allocatable, which reserves compute resources for system daemons.

Environment: Kubernetes v1.22.1, container runtime Containerd, cgroup driver systemd .

Understanding Allocatable Resources

The Allocatable value represents the amount of CPU, memory, and ephemeral‑storage that pods can request. It is shown alongside Capacity when running: kubectl describe node <node-name> Typical output:

Capacity:
  cpu: 4
  memory: 7990056Ki
  pods: 110
Allocatable:
  cpu: 4
  memory: 7887656Ki
  pods: 110

When no reservations are set, Capacity and Allocatable are nearly identical. The relationship is:

Node Allocatable Resource = Node Capacity - kube‑reserved - system‑reserved - eviction‑threshold

Pod requests summed across a node must not exceed its Allocatable value.

Configuring Resource Reservations

Reserve resources for the system using kubelet flags:

--enforce-node-allocatable=pods
--kube-reserved=memory=...
--system-reserved=memory=...
--eviction-hard=...

For a specific node (e.g., node2), edit /var/lib/kubelet/config.yaml:

apiVersion: kubelet.config.k8s.io/v1beta1
enforceNodeAllocatable:
- pods
kubeReserved:
  cpu: 500m
  memory: 1Gi
  ephemeral-storage: 1Gi
systemReserved:
  memory: 1Gi
evictionHard:
  memory.available: "300Mi"
  nodefs.available: "10%"

After restarting kubelet, re‑run kubectl describe node to see the reduced Allocatable values, confirming the reservation calculation.

Allocatable CPU: 3500m (Capacity 4 - 500m kube‑reserved)
Allocatable memory: 5585704Ki (Capacity 7990056Ki - 1Gi kube‑reserved - 1Gi system‑reserved)

Eviction vs OOM

Eviction is kubelet‑driven pod removal; OOM is cgroup‑triggered process kill.

Eviction thresholds (e.g., --eviction-hard=memory.available<20%) cause pod eviction when host memory usage exceeds 80%, but do not affect the cgroup limit /sys/fs/cgroup/memory/kubepods.slice/memory.limit_in_bytes, which equals capacity - kube‑reserved - system‑reserved.

Kubernetes evicts pods in order: first those without resource limits, then those with mismatched limits, and finally those with equal limits.

EnforceNodeAllocatable Details

The flag --enforce-node-allocatable accepts a comma‑separated list: none, pods, system-reserved, kube-reserved. Setting it to pods enforces the allocatable constraint for pods. Adding kube-reserved or system-reserved requires corresponding cgroup parameters.

For most users, enabling enforce-node-allocatable=pods and reserving appropriate kube‑reserved and system‑reserved resources is sufficient to keep nodes reliable without deep cgroup tuning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes Cluster stability Node Allocatable resource reservation

Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.