Safely Shut Down and Restart Your Kubernetes Cluster
This guide walks you through the essential steps, precautions, and commands needed to safely drain nodes, back up critical resources, shut down a Kubernetes cluster, and reliably bring it back online while avoiding common pitfalls.
Introduction
When maintaining a Kubernetes cluster, you may need to temporarily shut down or restart it for maintenance. This article explains how to safely shut down a K8s cluster and how to bring it back up.
Routine Node Maintenance
Shutting down a K8s cluster is risky; you must understand the consequences. First back up applications, custom resources (CRDs), and etcd, then proceed with shutdown or restart. In most cases, it is recommended to drain a maintenance node instead of restarting the whole cluster. The drain command is provided below.
First, identify the node you want to remove. List all nodes with: $ kubectl get nodes Then tell Kubernetes which node to drain: $ kubectl drain <node name> If the command returns without error, you can take the node offline (or delete the VM on the cloud platform). To keep the node in the cluster during maintenance, run: kubectl uncordon <node name> After that, Kubernetes will resume scheduling new Pods on the node.
Preparation Before Shutting Down the Cluster
Backup is the most critical preparation step to ensure applications can be restored. Create a checklist and verify each item before proceeding.
SSH password‑less login is configured between hosts
Application data is backed up
Custom resource definitions (CRDs) are backed up
Etcd data is backed up
Shutting Down the Kubernetes Cluster
Before shutting down, follow the recommended backup steps so you can restore the cluster and applications if any issues arise. The method described here can shut down the cluster smoothly, but data corruption is still possible.
First, obtain the list of nodes: k8snodes=$(kubectl get nodes -o name) Then shut down the nodes one by one, or run the following script to shut them down automatically:
for node in ${k8snodes[@]}
do
echo "==== Shut down $node ===="
ssh $node sudo shutdown -h 1
doneNote: SSH password‑less login must be set up between hosts.
After shutting down the nodes, you can proceed with other cluster‑dependent maintenance tasks.
Restarting the Kubernetes Cluster
After a restart, verify the status of all nodes and core components to ensure everything is ready.
$ kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
mars-k8s1 Ready control-plane,master 17d v1.21.0 172.16.60.60 <none> Ubuntu 20.04.1 LTS 5.11.0-40-generic docker://20.10.10
mars-k8s2 Ready <none> 17d v1.21.0 172.16.60.61 <none> Ubuntu 20.04.1 LTS 5.11.0-40-generic docker://20.10.10
mars-k8s3 Ready <none> 17d v1.21.0 172.16.60.62 <none> Ubuntu 20.04.1 LTS 5.11.0-40-generic docker://20.10.10
$ kubectl get svc -n kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 17d
metrics-server ClusterIP 10.111.227.248 <none> 443/TCP 17d
$ kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-558bd4d5db-h7jqc 1/1 Running 2 17d
coredns-558bd4d5db-wj4bn 1/1 Running 2 17d
etcd-mars-k8s1 1/1 Running 2 17d
kube-apiserver-mars-k8s1 1/1 Running 3 17d
kube-controller-manager-mars-k8s1 1/1 Running 2 17d
kube-flannel-ds-677dg 1/1 Running 2 17d
kube-flannel-ds-bxhx6 1/1 Running 3 17d
kube-flannel-ds-r5pqf 1/1 Running 2 17d
kube-proxy-6w52h 1/1 Running 2 17d
kube-proxy-p8zfp 1/1 Running 2 17d
kube-proxy-v8t7j 1/1 Running 2 17d
kube-scheduler-mars-k8s1 1/1 Running 2 17d
metrics-server-5f9459b95c-dtzbf 1/1 Running 2 17dKubernetes Cluster Restart Pitfalls Guide
Operations often involve luck, and I have supported data disaster‑recovery for clients across multiple regions. Always back up—multiple times if possible.
Even though many clusters restart without issue, unexpected problems can render a cluster unusable. Common failure scenarios include:
Etcd data corruption or node failure during shutdown, especially on bare‑metal nodes.
Network errors requiring thorough checks of all cluster dependencies with monitoring tools.
Application issues where the cluster is up but services are not reachable, necessitating backup‑restore to meet RTO.
Source: https://zhuanlan.zhihu.com/p/581228732
(Copyright belongs to the original author, please delete if infringed)
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
