Essential Kubernetes Ops: Node, Pod, Logging, and Monitoring Best Practices
This guide outlines practical steps for maintaining Kubernetes nodes, configuring pods, standardizing logging, and implementing effective monitoring and alerting to ensure stable, secure, and observable workloads in production environments.
1. Node Management
Kubernetes nodes can be physical or cloud hosts; their stability is critical. Basic operations include kernel upgrades, software updates, Docker daemon tuning, kubelet parameter adjustments, log management, and security hardening.
1.1 Kernel Upgrade
CentOS 7 ships with kernel 3.10, which has known Kubernetes bugs; upgrading to a newer kernel (e.g., 5.4) is recommended. Example upgrade commands:
wget https://elrepo.org/linux/kernel/el7/x86_64/RPMS/kernel-lt-5.4.86-1.el7.elrepo.x86_64.rpm
rpm -ivh kernel-lt-5.4.86-1.el7.elrepo.x86_64.rpm
cat /boot/grub2/grub.cfg | grep menuentry
grub2-set-default 'CentOS Linux (5.4.86-1.el7.elrepo.x86_64) 7 (Core)'
grub2-editenv list
grub2-mkconfig -o /boot/grub2/grub.cfg
reboot1.2 Docker Daemon Optimization
Adjust Docker daemon JSON to set log driver, log size limits, network bridge, OOM score, registry mirrors, storage driver, and enable live‑restore:
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {"max-size": "100m", "max-file": "10"},
"bip": "169.254.123.1/24",
"oom-score-adjust": -1000,
"registry-mirrors": ["https://pqbap4ya.mirror.aliyuncs.com"],
"storage-driver": "overlay2",
"storage-opts": ["overlay2.override_kernel_check=true"],
"live-restore": true
}1.3 Kubelet Parameter Tuning
Create or modify /etc/systemd/system/kubelet.service to reserve resources and configure eviction thresholds, preventing node crashes:
[Unit]
Description=kubelet: The Kubernetes Node Agent
Documentation=https://kubernetes.io/docs/
[Service]
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/pids/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpu/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuacct/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuset/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/memory/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/systemd/system.slice/kubelet.service
ExecStart=/usr/bin/kubelet \
--enforce-node-allocatable=pods,kube-reserved \
--kube-reserved-cgroup=/system.slice/kubelet.service \
--kube-reserved=cpu=200m,memory=250Mi \
--eviction-hard=memory.available<5%,nodefs.available<10%,imagefs.available<10% \
--eviction-soft=memory.available<10%,nodefs.available<15%,imagefs.available<15% \
--eviction-soft-grace-period=memory.available=2m,nodefs.available=2m,imagefs.available=2m \
--eviction-max-pod-grace-period=30 \
--eviction-minimum-reclaim=memory.available=0Mi,nodefs.available=500Mi,imagefs.available=500Mi
Restart=always
StartLimitInterval=0
RestartSec=10
[Install]
WantedBy=multi-user.target1.4 Log Management
Use rsyslog or OSS to forward system logs for forensic analysis. Remote backup of node logs is advisable.
1.5 Security Hardening
Implement common security policies such as password expiration, complexity, SSH login limits, session timeout, and command history restrictions.
2. Pod Configuration
Pods are the smallest scheduling unit; proper resource limits, scheduling policies, graceful upgrades, probes, protection strategies, and disruption budgets ensure reliability.
2.1 Resource Limits
Define resources.limits and resources.requests based on workload criticality. Example for a Guaranteed pod:
resources:
limits:
memory: "200Mi"
cpu: "700m"
requests:
memory: "200Mi"
cpu: "700m"For less critical workloads, use Burstable limits; avoid BestEffort.
2.2 Scheduling Strategies
Use node affinity, required or preferred, to pin pods to specific nodes. Example:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: env
operator: In
values:
- uatApply tolerations for tainted nodes and pod anti‑affinity to spread replicas across hosts.
2.3 Graceful Upgrade
Implement a preStop hook to pause traffic before termination, optionally deregistering from a service registry (e.g., Nacos):
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- "curl -X DELETE your_nacos_ip:8848/nacos/v1/ns/instance?serviceName=nacos.test.1&ip=${POD_IP}&port=8880&clusterName=DEFAULT && sleep 15"2.4 Probe Configuration
Configure livenessProbe, readinessProbe, and optionally startupProbe (available since v1.16) to let kubelet assess pod health. Example:
readinessProbe:
failureThreshold: 3
httpGet:
path: /health
port: http
scheme: HTTP
initialDelaySeconds: 40
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 3
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: http
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 2Startup probe example:
startupProbe:
httpGet:
path: /health
port: 80
failureThreshold: 10
initialDelaySeconds: 10
periodSeconds: 102.5 Protection Strategy
Use a PodDisruptionBudget to limit simultaneous pod evictions. Example:
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: pdb-demo
spec:
minAvailable: 2
selector:
matchLabels:
app: nginxOnly one of minAvailable or maxUnavailable may be set at a time.
3. Logging
Logging spans the entire application lifecycle. Adopt standards for log levels, formats, encoding, output paths, and naming to facilitate collection and analysis.
3.1 Log Collection
Two main approaches:
Deploy a logging agent on each node to capture stdout/stderr streams.
Run a sidecar container within the pod to collect non‑standard output logs.
Standard output collection is preferred for simplicity.
3.2 Log Analysis
Effective analysis helps pinpoint issues; services like Alibaba Cloud Log Service provide powerful query and visualization capabilities.
3.3 Alerting
Define precise log‑based alert keywords to avoid noisy or false alarms, ensuring alerts are actionable.
4. Monitoring
Observability requires cluster‑level, application‑level, event, and tracing monitoring, coupled with alert notifications.
4.1 Cluster Monitoring
Prometheus is the de‑facto solution for Kubernetes cluster metrics; monitor CPU, memory, pod health, etc.
4.2 Application Monitoring
Expose custom metrics in Prometheus format from applications; optionally use Java agents or exporters (e.g., jvm‑exporter, redis‑exporter).
4.3 Event Monitoring
Track Warning and Normal events using tools like kube-eventer to forward events to notification channels.
4.4 Tracing (Link Monitoring)
Use distributed tracing tools such as SkyWalking to visualize request flows across services.
4.5 Alert Notification
Select unique, problem‑reflecting metrics for alerts, classify severity, and route notifications through appropriate channels to ensure timely response.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
