How to Keep Your Kubernetes Nodes and Pods Stable: Essential Ops Practices
This guide walks through essential Kubernetes operations—from node kernel upgrades and Docker daemon tuning to pod resource limits, scheduling policies, health probes, logging standards, and comprehensive monitoring—providing practical commands and configurations to keep clusters stable and observable.
As Kubernetes matures, more companies deploy applications on it, but deployment is only the first step; ensuring stable, reliable operation of nodes and pods is essential.
Node
Nodes can be physical or cloud hosts and serve as the platform for Kubernetes. Operators should focus on preventing anomalies.
Key node maintenance tasks include kernel upgrades, software updates, Docker daemon configuration, kubelet parameter tuning, system log management, and security hardening.
Kernel Upgrade
CentOS 7 uses kernel 3.10, which has known Kubernetes bugs; upgrading to a newer kernel (e.g., 5.4.86) or switching to Ubuntu is recommended.
<code>wget https://elrepo.org/linux/kernel/el7/x86_64/RPMS/kernel-lt-5.4.86-1.el7.elrepo.x86_64.rpm
rpm -ivh kernel-lt-5.4.86-1.el7.elrepo.x86_64.rpm
cat /boot/grub2/grub.cfg | grep menuentry
grub2-set-default 'CentOS Linux (5.4.86-1.el7.elrepo.x86_64) 7 (Core)'
grub2-editenv list
grub2-mkconfig -o /boot/grub2/grub.cfg
reboot
</code>Software Update
Update high‑severity vulnerable packages while ensuring compatibility.
Docker Daemon Configuration
<code>cat > /etc/docker/daemon.json <<EOF
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m",
"max-file": "10"
},
"bip": "169.254.123.1/24",
"oom-score-adjust": -1000,
"registry-mirrors": ["https://pqbap4ya.mirror.aliyuncs.com"],
"storage-driver": "overlay2",
"storage-opts":["overlay2.override_kernel_check=true"],
"live-restore": true
}
EOF
</code>Kubelet Parameter Optimization
<code>cat > /etc/systemd/system/kubelet.service <<EOF
[Unit]
Description=kubelet: The Kubernetes Node Agent
Documentation=https://kubernetes.io/docs/
[Service]
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/pids/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpu/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuacct/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuset/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/memory/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/systemd/system.slice/kubelet.service
ExecStart=/usr/bin/kubelet \
--enforce-node-allocatable=pods,kube-reserved \
--kube-reserved-cgroup=/system.slice/kubelet.service \
--kube-reserved=cpu=200m,memory=250Mi \
--eviction-hard=memory.available<5%,nodefs.available<10%,imagefs.available<10% \
--eviction-soft=memory.available<10%,nodefs.available<15%,imagefs.available<15% \
--eviction-soft-grace-period=memory.available=2m,nodefs.available=2m,imagefs.available=2m \
--eviction-max-pod-grace-period=30 \
--eviction-minimum-reclaim=memory.available=0Mi,nodefs.available=500Mi,imagefs.available=500Mi
Restart=always
StartLimitInterval=0
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
</code>This adds resource reservations to reduce node crashes.
Log Management
System logs should be centrally backed up (e.g., via rsyslog) to enable forensic analysis after incidents.
Security Hardening
SSH password expiration policy
Password complexity policy
SSH login attempt limits
System idle timeout
History record configuration
Pod
Pods are the smallest scheduling unit; their stability directly impacts applications. Key considerations include resource limits, scheduling policies, graceful upgrades, probes, protection strategies, logging, collection, analysis, and alerting.
Resource Limits
Define limits and requests to avoid resource over‑commitment. Use Guaranteed QoS for critical workloads and Burstable for typical workloads.
<code>resources:
limits:
memory: "200Mi"
cpu: "700m"
requests:
memory: "200Mi"
cpu: "700m"
</code>Do not use BestEffort.
Scheduling Policies
Node affinity, taints & tolerations, and pod anti‑affinity ensure pods run on appropriate nodes and improve high availability.
<code>affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference: {}
weight: 100
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: env
operator: In
values:
- uat
</code> <code>tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoExecute"
tolerationSeconds: 3600
</code> <code>affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
</code>Graceful Upgrade
Use preStop hooks to delay termination, optionally deregister from a service registry such as Nacos.
<code>lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- sleep 15
</code> <code>lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- "curl -X DELETE your_nacos_ip:8848/nacos/v1/ns/instance?serviceName=nacos.test.1&ip=${POD_IP}&port=8880&clusterName=DEFAULT && sleep 15"
</code>Probes
Configure liveness, readiness, and (optionally) startup probes to let kubelet assess pod health.
<code>readinessProbe:
failureThreshold: 3
httpGet:
path: /health
port: http
scheme: HTTP
initialDelaySeconds: 40
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 3
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: http
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 2
</code> <code>startupProbe:
httpGet:
path: /health
port: 80
failureThreshold: 10
initialDelaySeconds: 10
periodSeconds: 10
</code>Protection Strategy
Use PodDisruptionBudget to control the number of pods that can be voluntarily evicted.
<code>apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: pdb-demo
spec:
minAvailable: 2
selector:
matchLabels:
app: nginx
</code>Logging
Standardize log levels, formats, encoding, and output paths. Collect logs via a node‑side logging agent for stdout logs or a sidecar container for non‑stdout logs.
Alerting
Define precise log keywords to avoid noisy alerts and ensure alerts correspond to actionable issues.
Monitoring
Effective monitoring provides observability for clusters and applications, enabling rapid issue detection and resolution.
Cluster Monitoring
Prometheus is commonly used to monitor Kubernetes clusters; monitor key metrics such as CPU, memory, pod status, etc.
Application Monitoring
Expose application metrics in Prometheus format, optionally using Java agents to export JVM metrics.
Event Monitoring
Track Warning and Normal events; tools like kube‑eventer can forward events to notification channels.
Link Monitoring
Use distributed tracing tools (e.g., SkyWalking) to visualize inter‑service call chains.
Alert Notification
Select unique, problem‑reflecting metrics for alerts, classify severity, and choose appropriate notification channels.
Conclusion
The practices described constitute essential skills for YAML engineers and are applicable in most enterprises.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.