Operations 9 min read

Safely Shut Down and Restart Your Kubernetes Cluster

This guide walks you through the essential steps, precautions, and commands needed to safely drain nodes, back up critical resources, shut down a Kubernetes cluster, and reliably bring it back online while avoiding common pitfalls.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Safely Shut Down and Restart Your Kubernetes Cluster

Introduction

When maintaining a Kubernetes cluster, you may need to temporarily shut down or restart it for maintenance. This article explains how to safely shut down a K8s cluster and how to bring it back up.

Routine Node Maintenance

Shutting down a K8s cluster is risky; you must understand the consequences. First back up applications, custom resources (CRDs), and etcd, then proceed with shutdown or restart. In most cases, it is recommended to drain a maintenance node instead of restarting the whole cluster. The drain command is provided below.

First, identify the node you want to remove. List all nodes with: $ kubectl get nodes Then tell Kubernetes which node to drain: $ kubectl drain <node name> If the command returns without error, you can take the node offline (or delete the VM on the cloud platform). To keep the node in the cluster during maintenance, run: kubectl uncordon <node name> After that, Kubernetes will resume scheduling new Pods on the node.

Preparation Before Shutting Down the Cluster

Backup is the most critical preparation step to ensure applications can be restored. Create a checklist and verify each item before proceeding.

SSH password‑less login is configured between hosts

Application data is backed up

Custom resource definitions (CRDs) are backed up

Etcd data is backed up

Shutting Down the Kubernetes Cluster

Before shutting down, follow the recommended backup steps so you can restore the cluster and applications if any issues arise. The method described here can shut down the cluster smoothly, but data corruption is still possible.

First, obtain the list of nodes: k8snodes=$(kubectl get nodes -o name) Then shut down the nodes one by one, or run the following script to shut them down automatically:

for node in ${k8snodes[@]}
do
    echo "==== Shut down $node ===="
    ssh $node sudo shutdown -h 1
done
Note: SSH password‑less login must be set up between hosts.

After shutting down the nodes, you can proceed with other cluster‑dependent maintenance tasks.

Restarting the Kubernetes Cluster

After a restart, verify the status of all nodes and core components to ensure everything is ready.

$ kubectl get nodes -o wide
NAME        STATUS   ROLES                     AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE            KERNEL-VERSION      CONTAINER-RUNTIME
mars-k8s1   Ready    control-plane,master      17d   v1.21.0   172.16.60.60   <none>        Ubuntu 20.04.1 LTS  5.11.0-40-generic    docker://20.10.10
mars-k8s2   Ready    <none>                    17d   v1.21.0   172.16.60.61   <none>        Ubuntu 20.04.1 LTS  5.11.0-40-generic    docker://20.10.10
mars-k8s3   Ready    <none>                    17d   v1.21.0   172.16.60.62   <none>        Ubuntu 20.04.1 LTS  5.11.0-40-generic    docker://20.10.10

$ kubectl get svc -n kube-system
NAME           TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                     AGE
kube-dns       ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP      17d
metrics-server ClusterIP   10.111.227.248 <none>      443/TCP                     17d

$ kubectl get pod -n kube-system
NAME                                 READY   STATUS    RESTARTS   AGE
coredns-558bd4d5db-h7jqc            1/1     Running   2          17d
coredns-558bd4d5db-wj4bn            1/1     Running   2          17d
etcd-mars-k8s1                       1/1     Running   2          17d
kube-apiserver-mars-k8s1            1/1     Running   3          17d
kube-controller-manager-mars-k8s1   1/1     Running   2          17d
kube-flannel-ds-677dg                1/1     Running   2          17d
kube-flannel-ds-bxhx6                1/1     Running   3          17d
kube-flannel-ds-r5pqf                1/1     Running   2          17d
kube-proxy-6w52h                     1/1     Running   2          17d
kube-proxy-p8zfp                     1/1     Running   2          17d
kube-proxy-v8t7j                     1/1     Running   2          17d
kube-scheduler-mars-k8s1             1/1     Running   2          17d
metrics-server-5f9459b95c-dtzbf     1/1     Running   2          17d

Kubernetes Cluster Restart Pitfalls Guide

Operations often involve luck, and I have supported data disaster‑recovery for clients across multiple regions. Always back up—multiple times if possible.

Even though many clusters restart without issue, unexpected problems can render a cluster unusable. Common failure scenarios include:

Etcd data corruption or node failure during shutdown, especially on bare‑metal nodes.

Network errors requiring thorough checks of all cluster dependencies with monitoring tools.

Application issues where the cluster is up but services are not reachable, necessitating backup‑restore to meet RTO.

Source: https://zhuanlan.zhihu.com/p/581228732

(Copyright belongs to the original author, please delete if infringed)

DevOpsShutdownRestartCluster Maintenance
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.