Mastering Kubernetes: Essential Node & Pod Practices for Stable, Secure Deployments
This article outlines essential Kubernetes operational practices—including node maintenance, kernel upgrades, Docker and kubelet tuning, pod resource limits, scheduling strategies, health probes, logging standards, and monitoring setups—to ensure applications run reliably, securely, and efficiently in production environments.
As Kubernetes matures, more companies deploy applications on it, but containerization is only the first step; ensuring stable, secure operation is essential.
Node
A Node can be a physical or cloud host and serves as the Kubernetes carrier. Operations focus on preventing anomalies.
Key Node tasks include:
Kernel upgrade
Software updates
Docker daemon optimization
Kubelet parameter tuning
Log configuration management
Security hardening
Kernel Upgrade
CentOS 7 uses kernel 3.10, which has many known bugs in Kubernetes; upgrading to a newer kernel (e.g., 5.4) or using Ubuntu is recommended.
<code>wget https://elrepo.org/linux/kernel/el7/x86_64/RPMS/kernel-lt-5.4.86-1.el7.elrepo.x86_64.rpm
rpm -ivh kernel-lt-5.4.86-1.el7.elrepo.x86_64.rpm
cat /boot/grub2/grub.cfg | grep menuentry
grub2-set-default 'CentOS Linux (5.4.86-1.el7.elrepo.x86_64) 7 (Core)'
grub2-editenv list
grub2-mkconfig -o /boot/grub2/grub.cfg
reboot</code>Software Updates
Update high‑risk vulnerable packages while balancing compatibility.
Docker Daemon Optimization
<code>cat > /etc/docker/daemon.json <<EOF
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m",
"max-file": "10"
},
"bip": "169.254.123.1/24",
"oom-score-adjust": -1000,
"registry-mirrors": ["https://pqbap4ya.mirror.aliyuncs.com"],
"storage-driver": "overlay2",
"storage-opts": ["overlay2.override_kernel_check=true"],
"live-restore": true
}
EOF</code>Kubelet Parameter Tuning
<code>cat > /etc/systemd/system/kubelet.service <<EOF
[Unit]
Description=kubelet: The Kubernetes Node Agent
Documentation=https://kubernetes.io/docs/
[Service]
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/pids/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpu/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuacct/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuset/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/memory/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/systemd/system.slice/kubelet.service
ExecStart=/usr/bin/kubelet \
--enforce-node-allocatable=pods,kube-reserved \
--kube-reserved-cgroup=/system.slice/kubelet.service \
--kube-reserved=cpu=200m,memory=250Mi \
--eviction-hard=memory.available<5%,nodefs.available<10%,imagefs.available<10% \
--eviction-soft=memory.available<10%,nodefs.available<15%,imagefs.available<15% \
--eviction-soft-grace-period=memory.available=2m,nodefs.available=2m,imagefs.available=2m \
--eviction-max-pod-grace-period=30 \
--eviction-minimum-reclaim=memory.available=0Mi,nodefs.available=500Mi,imagefs.available=500Mi
Restart=always
StartLimitInterval=0
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF</code>Log Configuration Management
Use
rsyslogor OSS to forward system logs for forensic analysis.
Security Hardening
SSH password expiration policy
Password complexity policy
SSH login attempt limits
System timeout configuration
History record configuration
Pod
Pods are the smallest scheduling unit; their stability directly affects applications.
Resource Limits
Choose QoS class based on workload importance.
Guaranteed (high‑priority):
<code>resources:
limits:
memory: "200Mi"
cpu: "700m"
requests:
memory: "200Mi"
cpu: "700m"
</code>Burstable (general):
<code>resources:
limits:
memory: "200Mi"
cpu: "500m"
requests:
memory: "100Mi"
cpu: "100m"
</code>Avoid using BestEffort.
Scheduling Strategies
Node affinity, taints & tolerations, and pod anti‑affinity help control placement.
<code>affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference: {}
weight: 100
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: env
operator: In
values:
- uat
</code> <code>tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoExecute"
tolerationSeconds: 3600
</code> <code>affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
</code>Graceful Upgrade
Use preStop hooks to delay termination or deregister from service registry.
<code>lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- sleep 15
</code> <code>lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- "curl -X DELETE your_nacos_ip:8848/nacos/v1/ns/instance?serviceName=nacos.test.1&ip=${POD_IP}&port=8880&clusterName=DEFAULT && sleep 15"
</code>Probes
Configure liveness, readiness, and optionally startup probes.
<code>readinessProbe:
failureThreshold: 3
httpGet:
path: /health
port: http
scheme: HTTP
initialDelaySeconds: 40
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 3
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: http
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 2
</code> <code>startupProbe:
httpGet:
path: /health
port: 80
failureThreshold: 10
initialDelaySeconds: 10
periodSeconds: 10
</code>Protection Strategy
Use PodDisruptionBudget to limit voluntary disruptions.
<code>apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: pdb-demo
spec:
minAvailable: 2
selector:
matchLabels:
app: nginx
</code>Note: minAvailable and maxUnavailable are mutually exclusive.
Logging
Logging spans business and exception logs; it should be simple yet informative, supporting monitoring, debugging, and minimal performance impact.
Log Standards
Use appropriate log levels
Unified output format
Consistent code encoding
Standardized log paths
Standardized naming conventions
Collection
Two main approaches:
Deploy a logging agent on the Node to collect stdout logs.
Run a sidecar container in the Pod to collect application logs.
Analysis
Effective log analysis helps pinpoint issues; services like Alibaba Cloud Log Service provide powerful analysis capabilities.
Alerting
Define precise alert keywords to avoid noise and ensure alerts indicate actionable problems.
Monitoring
Observability across cluster and applications is vital for reliability.
Cluster Monitoring
Prometheus is commonly used to monitor Kubernetes clusters; key metrics include node health, API server latency, etc.
Application Monitoring
Expose application metrics in Prometheus format; javaagent can be used to collect JVM metrics.
Event Monitoring
Monitor Warning and Normal events using tools like kube-eventer to detect abnormal state transitions.
Link Monitoring
Use tracing tools such as SkyWalking to visualize inter‑service calls and diagnose latency issues.
Alert Notification
Select unique, problem‑reflecting metrics for alerts and classify urgency to ensure timely response.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.