Operations 18 min read

How to Keep Your Kubernetes Nodes and Pods Stable: Essential Ops Practices

This guide walks through essential Kubernetes operations—from node kernel upgrades and Docker daemon tuning to pod resource limits, scheduling policies, health probes, logging standards, and comprehensive monitoring—providing practical commands and configurations to keep clusters stable and observable.

Efficient Ops
Efficient Ops
Efficient Ops
How to Keep Your Kubernetes Nodes and Pods Stable: Essential Ops Practices

As Kubernetes matures, more companies deploy applications on it, but deployment is only the first step; ensuring stable, reliable operation of nodes and pods is essential.

Node

Nodes can be physical or cloud hosts and serve as the platform for Kubernetes. Operators should focus on preventing anomalies.

Key node maintenance tasks include kernel upgrades, software updates, Docker daemon configuration, kubelet parameter tuning, system log management, and security hardening.

Kernel Upgrade

CentOS 7 uses kernel 3.10, which has known Kubernetes bugs; upgrading to a newer kernel (e.g., 5.4.86) or switching to Ubuntu is recommended.

<code>wget https://elrepo.org/linux/kernel/el7/x86_64/RPMS/kernel-lt-5.4.86-1.el7.elrepo.x86_64.rpm
rpm -ivh kernel-lt-5.4.86-1.el7.elrepo.x86_64.rpm
cat /boot/grub2/grub.cfg | grep menuentry
grub2-set-default 'CentOS Linux (5.4.86-1.el7.elrepo.x86_64) 7 (Core)'
grub2-editenv list
grub2-mkconfig -o /boot/grub2/grub.cfg
reboot
</code>

Software Update

Update high‑severity vulnerable packages while ensuring compatibility.

Docker Daemon Configuration

<code>cat > /etc/docker/daemon.json <<EOF
{
    "exec-opts": ["native.cgroupdriver=systemd"],
    "log-driver": "json-file",
    "log-opts": {
        "max-size": "100m",
        "max-file": "10"
    },
    "bip": "169.254.123.1/24",
    "oom-score-adjust": -1000,
    "registry-mirrors": ["https://pqbap4ya.mirror.aliyuncs.com"],
    "storage-driver": "overlay2",
    "storage-opts":["overlay2.override_kernel_check=true"],
    "live-restore": true
}
EOF
</code>

Kubelet Parameter Optimization

<code>cat > /etc/systemd/system/kubelet.service <<EOF
[Unit]
Description=kubelet: The Kubernetes Node Agent
Documentation=https://kubernetes.io/docs/
[Service]
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/pids/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpu/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuacct/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuset/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/memory/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/systemd/system.slice/kubelet.service
ExecStart=/usr/bin/kubelet \
  --enforce-node-allocatable=pods,kube-reserved \
  --kube-reserved-cgroup=/system.slice/kubelet.service \
  --kube-reserved=cpu=200m,memory=250Mi \
  --eviction-hard=memory.available<5%,nodefs.available<10%,imagefs.available<10% \
  --eviction-soft=memory.available<10%,nodefs.available<15%,imagefs.available<15% \
  --eviction-soft-grace-period=memory.available=2m,nodefs.available=2m,imagefs.available=2m \
  --eviction-max-pod-grace-period=30 \
  --eviction-minimum-reclaim=memory.available=0Mi,nodefs.available=500Mi,imagefs.available=500Mi
Restart=always
StartLimitInterval=0
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
</code>

This adds resource reservations to reduce node crashes.

Log Management

System logs should be centrally backed up (e.g., via rsyslog) to enable forensic analysis after incidents.

Security Hardening

SSH password expiration policy

Password complexity policy

SSH login attempt limits

System idle timeout

History record configuration

Pod

Pods are the smallest scheduling unit; their stability directly impacts applications. Key considerations include resource limits, scheduling policies, graceful upgrades, probes, protection strategies, logging, collection, analysis, and alerting.

Resource Limits

Define limits and requests to avoid resource over‑commitment. Use Guaranteed QoS for critical workloads and Burstable for typical workloads.

<code>resources:
  limits:
    memory: "200Mi"
    cpu: "700m"
  requests:
    memory: "200Mi"
    cpu: "700m"
</code>

Do not use BestEffort.

Scheduling Policies

Node affinity, taints & tolerations, and pod anti‑affinity ensure pods run on appropriate nodes and improve high availability.

<code>affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - preference: {}
        weight: 100
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: env
              operator: In
              values:
                - uat
</code>
<code>tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoExecute"
  tolerationSeconds: 3600
</code>
<code>affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - store
      topologyKey: "kubernetes.io/hostname"
</code>

Graceful Upgrade

Use preStop hooks to delay termination, optionally deregister from a service registry such as Nacos.

<code>lifecycle:
  preStop:
    exec:
      command:
      - /bin/sh
      - -c
      - sleep 15
</code>
<code>lifecycle:
  preStop:
    exec:
      command:
        - /bin/sh
        - -c
        - "curl -X DELETE your_nacos_ip:8848/nacos/v1/ns/instance?serviceName=nacos.test.1&ip=${POD_IP}&port=8880&clusterName=DEFAULT && sleep 15"
</code>

Probes

Configure liveness, readiness, and (optionally) startup probes to let kubelet assess pod health.

<code>readinessProbe:
  failureThreshold: 3
  httpGet:
    path: /health
    port: http
    scheme: HTTP
  initialDelaySeconds: 40
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 3
livenessProbe:
  failureThreshold: 3
  httpGet:
    path: /health
    port: http
    scheme: HTTP
  initialDelaySeconds: 60
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 2
</code>
<code>startupProbe:
  httpGet:
    path: /health
    port: 80
  failureThreshold: 10
  initialDelaySeconds: 10
  periodSeconds: 10
</code>

Protection Strategy

Use PodDisruptionBudget to control the number of pods that can be voluntarily evicted.

<code>apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: pdb-demo
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: nginx
</code>

Logging

Standardize log levels, formats, encoding, and output paths. Collect logs via a node‑side logging agent for stdout logs or a sidecar container for non‑stdout logs.

Alerting

Define precise log keywords to avoid noisy alerts and ensure alerts correspond to actionable issues.

Monitoring

Effective monitoring provides observability for clusters and applications, enabling rapid issue detection and resolution.

Cluster Monitoring

Prometheus is commonly used to monitor Kubernetes clusters; monitor key metrics such as CPU, memory, pod status, etc.

Application Monitoring

Expose application metrics in Prometheus format, optionally using Java agents to export JVM metrics.

Event Monitoring

Track Warning and Normal events; tools like kube‑eventer can forward events to notification channels.

Link Monitoring

Use distributed tracing tools (e.g., SkyWalking) to visualize inter‑service call chains.

Alert Notification

Select unique, problem‑reflecting metrics for alerts, classify severity, and choose appropriate notification channels.

Conclusion

The practices described constitute essential skills for YAML engineers and are applicable in most enterprises.

monitoringoperationsKubernetesloggingNode ManagementPod Configuration
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.