Zero‑Downtime Kubernetes Node Maintenance: Complete SOP for Adding and Removing Nodes
This guide presents a step‑by‑step SOP for safely decommissioning and provisioning Kubernetes nodes in production, covering lifecycle labeling, RBAC safeguards, draining procedures, validation checks, handling StatefulSets and local storage, automation tips, and real‑world incident examples to ensure zero downtime and data loss.
Overview
The goal is to achieve zero business interruption, zero data loss, and zero uncontrolled risk when performing any node maintenance, scaling, replacement, or migration in a production Kubernetes cluster. A node is treated as a "move" rather than a shutdown, and maintenance is considered a cluster‑level release.
1. Node Lifecycle Model
active : participates in scheduling normally.
cordon : prevents new Pods from being scheduled on the node.
draining : Pods are being evicted and migrated.
retired : the node is retired and must not be reused.
deleted : the node object is removed from the cluster.
Label management commands:
kubectl label node node-1 lifecycle=active
kubectl label node node-1 lifecycle=draining
kubectl label node node-1 lifecycle=retired2. RBAC Permission Model (Prevent Accidental Deletion)
Separate privileges in production environments so that ordinary operators cannot delete nodes. Example ClusterRole that allows only safe operations on nodes:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: node-maintainer
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "watch", "patch", "update"]Enforce three‑person segregation or two‑person review for cordon, drain, and delete node actions.
3. Full Node Decommission Process (Complete Version)
Planning → Lock → Evict → Verify → Delete → Clean
3.1 Capacity and Load Assessment
kubectl top nodes
kubectl top pods -A
kubectl describe nodes | grep -A5 AllocatableCheck the following resources:
CPU
Memory
GPU / local disk
Whether Horizontal Pod Autoscaler (HPA) would trigger scaling
3.2 Lock the Node (Prevent New Pods)
kubectl cordon node-1
kubectl label node node-1 lifecycle=draining3.3 Evict Pods (Recommended Parameters)
kubectl drain node-1 \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=60 \
--timeout=10mParameter meanings: --ignore-daemonsets: skip eviction of DaemonSet Pods (system Pods). --delete-emptydir-data: allow deletion of temporary emptyDir data. --grace-period=60: give workloads 60 seconds to shut down gracefully. --timeout=10m: abort the drain if it does not finish within 10 minutes.
Drain respects PodDisruptionBudget , which is the primary safety valve for high availability.
3.4 Verify Eviction Completion
kubectl get pods -A --field-selector spec.nodeName=node-1Only DaemonSet Pods from kube-system should remain.
3.5 Delete the Node Object
kubectl delete node node-1
kubectl label node node-1 lifecycle=retired3.6 Physical/Cloud Resource Cleanup
Release the VM on the cloud provider.
Power off and recycle the physical machine.
Remove residual containers, mount points, and CNI bridges.
4. StatefulSet & Local Storage Special Process (High‑Risk)
StatefulSet + Network Storage : drain is allowed.
StatefulSet + Local PV : drain is prohibited.
Database node : must switch primary role before draining.
Kafka / Elasticsearch / Zookeeper : must migrate leader or shard before draining.
Example: Kafka Broker Decommission
# 1. Reassign partitions at the Kafka layer
kafka-reassign-partitions.sh ...
# 2. Verify the broker holds no leaders
kafka-topics.sh --describe
# 3. Perform the Kubernetes decommission steps
kubectl cordon node-kafka-1
kubectl drain node-kafka-1 ...Skipping the Kafka‑level migration raises the data‑loss probability to >50%.
5. Node Addition Standard Process
Prepare → Join → Verify → Enable
5.1 System Preparation
Consistent OS version across the cluster.
Disable swap.
Open required network ports (e.g., 6443, 10250, 10255).
Install containerd, kubelet, and kubeadm.
5.2 Join the Cluster
sudo kubeadm join 10.0.0.10:6443 \
--token abcdef.123456 \
--discovery-token-ca-cert-hash sha256:xxxx5.3 Readiness Checks
kubectl get nodes
kubectl describe node node-newRequired node conditions (all must be as shown):
Ready : True
DiskPressure : False
MemoryPressure : False
PIDPressure : False
NetworkUnavailable : False
5.4 Core Component Verification
kubectl get pods -n kube-system -o wide | grep node-newEnsure that the CNI plugin and kube-proxy are running on the new node.
5.5 Real‑World Business Validation
kubectl run test --image=nginx --restart=Never
kubectl get pod -o widePod is scheduled successfully.
Associated Service is reachable.
6. Maintenance Window & Node Disruption Budget
Simultaneous node decommission should not exceed 10 % of the total nodes.
Core nodes (e.g., control‑plane or etcd) must be decommissioned one at a time.
Interval between two drain operations should be at least 5 minutes to allow the cluster to stabilize.
Node maintenance is a release; it requires a defined window, canary testing, and a rollback plan.
7. Observability & Rollback
Operational monitoring commands (run during a drain):
kubectl get pods -A | grep Pending
kubectl get events -A --sort-by=.lastTimestamp
kubectl get hpa -ABusiness‑side metrics to watch:
QPS
Error rate
Latency
Restart count
If an anomaly is detected, stop the drain, analyze scheduling failures, and restore capacity:
# Example manual rollback steps
kubectl uncordon node-1 # allow scheduling again
# Re‑schedule pending Pods or scale up additional nodes as needed8. Real Incident Cases
Case 1: Direct delete node
All Pods disappear instantly.
Service outage lasts ~30 seconds.
Data loss occurs.
Root cause: missing cordon and drain steps.
Case 2: Drain a Node with Local PV
MySQL data directory is lost.
Data is irrecoverable.
Root cause: treating a node with local storage as stateless.
Case 3: Overly Strict PodDisruptionBudget
minAvailable: 3
replicas: 3All nodes become unmaintainable because no pod can be evicted.
Fix by reducing minAvailable to a value that still preserves quorum, e.g., 2:
minAvailable: 29. Automation Directions (Advanced)
Cluster Autoscaler
Karpenter
Node pools with built‑in disruption budgets
Platform‑level SOP buttonization for one‑click decommission/provision
10. Final Summary Rules
Decommission: Lock → Evict → Demolish → Clean foundation
Provision: Lay foundation → Connect utilities → Verify safety → Allow occupancy
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ray's Galactic Tech
Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
