Cloud Native 10 min read

Zero‑Downtime Kubernetes Node Maintenance: Complete SOP for Adding and Removing Nodes

This guide presents a step‑by‑step SOP for safely decommissioning and provisioning Kubernetes nodes in production, covering lifecycle labeling, RBAC safeguards, draining procedures, validation checks, handling StatefulSets and local storage, automation tips, and real‑world incident examples to ensure zero downtime and data loss.

Ray's Galactic Tech
Ray's Galactic Tech
Ray's Galactic Tech
Zero‑Downtime Kubernetes Node Maintenance: Complete SOP for Adding and Removing Nodes

Overview

The goal is to achieve zero business interruption, zero data loss, and zero uncontrolled risk when performing any node maintenance, scaling, replacement, or migration in a production Kubernetes cluster. A node is treated as a "move" rather than a shutdown, and maintenance is considered a cluster‑level release.

1. Node Lifecycle Model

active : participates in scheduling normally.

cordon : prevents new Pods from being scheduled on the node.

draining : Pods are being evicted and migrated.

retired : the node is retired and must not be reused.

deleted : the node object is removed from the cluster.

Label management commands:

kubectl label node node-1 lifecycle=active
kubectl label node node-1 lifecycle=draining
kubectl label node node-1 lifecycle=retired

2. RBAC Permission Model (Prevent Accidental Deletion)

Separate privileges in production environments so that ordinary operators cannot delete nodes. Example ClusterRole that allows only safe operations on nodes:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: node-maintainer
rules:
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "list", "watch", "patch", "update"]

Enforce three‑person segregation or two‑person review for cordon, drain, and delete node actions.

3. Full Node Decommission Process (Complete Version)

Planning → Lock → Evict → Verify → Delete → Clean

3.1 Capacity and Load Assessment

kubectl top nodes
kubectl top pods -A
kubectl describe nodes | grep -A5 Allocatable

Check the following resources:

CPU

Memory

GPU / local disk

Whether Horizontal Pod Autoscaler (HPA) would trigger scaling

3.2 Lock the Node (Prevent New Pods)

kubectl cordon node-1
kubectl label node node-1 lifecycle=draining

3.3 Evict Pods (Recommended Parameters)

kubectl drain node-1 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=60 \
  --timeout=10m

Parameter meanings: --ignore-daemonsets: skip eviction of DaemonSet Pods (system Pods). --delete-emptydir-data: allow deletion of temporary emptyDir data. --grace-period=60: give workloads 60 seconds to shut down gracefully. --timeout=10m: abort the drain if it does not finish within 10 minutes.

Drain respects PodDisruptionBudget , which is the primary safety valve for high availability.

3.4 Verify Eviction Completion

kubectl get pods -A --field-selector spec.nodeName=node-1

Only DaemonSet Pods from kube-system should remain.

3.5 Delete the Node Object

kubectl delete node node-1
kubectl label node node-1 lifecycle=retired

3.6 Physical/Cloud Resource Cleanup

Release the VM on the cloud provider.

Power off and recycle the physical machine.

Remove residual containers, mount points, and CNI bridges.

4. StatefulSet & Local Storage Special Process (High‑Risk)

StatefulSet + Network Storage : drain is allowed.

StatefulSet + Local PV : drain is prohibited.

Database node : must switch primary role before draining.

Kafka / Elasticsearch / Zookeeper : must migrate leader or shard before draining.

Example: Kafka Broker Decommission

# 1. Reassign partitions at the Kafka layer
kafka-reassign-partitions.sh ...

# 2. Verify the broker holds no leaders
kafka-topics.sh --describe

# 3. Perform the Kubernetes decommission steps
kubectl cordon node-kafka-1
kubectl drain node-kafka-1 ...
Skipping the Kafka‑level migration raises the data‑loss probability to >50%.

5. Node Addition Standard Process

Prepare → Join → Verify → Enable

5.1 System Preparation

Consistent OS version across the cluster.

Disable swap.

Open required network ports (e.g., 6443, 10250, 10255).

Install containerd, kubelet, and kubeadm.

5.2 Join the Cluster

sudo kubeadm join 10.0.0.10:6443 \
  --token abcdef.123456 \
  --discovery-token-ca-cert-hash sha256:xxxx

5.3 Readiness Checks

kubectl get nodes
kubectl describe node node-new

Required node conditions (all must be as shown):

Ready : True

DiskPressure : False

MemoryPressure : False

PIDPressure : False

NetworkUnavailable : False

5.4 Core Component Verification

kubectl get pods -n kube-system -o wide | grep node-new

Ensure that the CNI plugin and kube-proxy are running on the new node.

5.5 Real‑World Business Validation

kubectl run test --image=nginx --restart=Never
kubectl get pod -o wide

Pod is scheduled successfully.

Associated Service is reachable.

6. Maintenance Window & Node Disruption Budget

Simultaneous node decommission should not exceed 10 % of the total nodes.

Core nodes (e.g., control‑plane or etcd) must be decommissioned one at a time.

Interval between two drain operations should be at least 5 minutes to allow the cluster to stabilize.

Node maintenance is a release; it requires a defined window, canary testing, and a rollback plan.

7. Observability & Rollback

Operational monitoring commands (run during a drain):

kubectl get pods -A | grep Pending
kubectl get events -A --sort-by=.lastTimestamp
kubectl get hpa -A

Business‑side metrics to watch:

QPS

Error rate

Latency

Restart count

If an anomaly is detected, stop the drain, analyze scheduling failures, and restore capacity:

# Example manual rollback steps
kubectl uncordon node-1   # allow scheduling again
# Re‑schedule pending Pods or scale up additional nodes as needed

8. Real Incident Cases

Case 1: Direct delete node

All Pods disappear instantly.

Service outage lasts ~30 seconds.

Data loss occurs.

Root cause: missing cordon and drain steps.

Case 2: Drain a Node with Local PV

MySQL data directory is lost.

Data is irrecoverable.

Root cause: treating a node with local storage as stateless.

Case 3: Overly Strict PodDisruptionBudget

minAvailable: 3
replicas: 3

All nodes become unmaintainable because no pod can be evicted.

Fix by reducing minAvailable to a value that still preserves quorum, e.g., 2:

minAvailable: 2

9. Automation Directions (Advanced)

Cluster Autoscaler

Karpenter

Node pools with built‑in disruption budgets

Platform‑level SOP buttonization for one‑click decommission/provision

10. Final Summary Rules

Decommission: Lock → Evict → Demolish → Clean foundation
Provision: Lay foundation → Connect utilities → Verify safety → Allow occupancy
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

SOPRBACStatefulSetNode MaintenanceProduction Ops
Ray's Galactic Tech
Written by

Ray's Galactic Tech

Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.