Operations 19 min read

Essential Kubernetes Ops: Node, Pod, Logging, and Monitoring Best Practices

This guide outlines practical steps for maintaining Kubernetes nodes, configuring pods, standardizing logging, and implementing effective monitoring and alerting to ensure stable, secure, and observable workloads in production environments.

dbaplus Community

Aug 16, 2021

Essential Kubernetes Ops: Node, Pod, Logging, and Monitoring Best Practices

1. Node Management

Kubernetes nodes can be physical or cloud hosts; their stability is critical. Basic operations include kernel upgrades, software updates, Docker daemon tuning, kubelet parameter adjustments, log management, and security hardening.

1.1 Kernel Upgrade

CentOS 7 ships with kernel 3.10, which has known Kubernetes bugs; upgrading to a newer kernel (e.g., 5.4) is recommended. Example upgrade commands:

wget https://elrepo.org/linux/kernel/el7/x86_64/RPMS/kernel-lt-5.4.86-1.el7.elrepo.x86_64.rpm
rpm -ivh kernel-lt-5.4.86-1.el7.elrepo.x86_64.rpm
cat /boot/grub2/grub.cfg | grep menuentry
grub2-set-default 'CentOS Linux (5.4.86-1.el7.elrepo.x86_64) 7 (Core)'
grub2-editenv list
grub2-mkconfig -o /boot/grub2/grub.cfg
reboot

1.2 Docker Daemon Optimization

Adjust Docker daemon JSON to set log driver, log size limits, network bridge, OOM score, registry mirrors, storage driver, and enable live‑restore:

{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {"max-size": "100m", "max-file": "10"},
  "bip": "169.254.123.1/24",
  "oom-score-adjust": -1000,
  "registry-mirrors": ["https://pqbap4ya.mirror.aliyuncs.com"],
  "storage-driver": "overlay2",
  "storage-opts": ["overlay2.override_kernel_check=true"],
  "live-restore": true
}

1.3 Kubelet Parameter Tuning

Create or modify /etc/systemd/system/kubelet.service to reserve resources and configure eviction thresholds, preventing node crashes:

[Unit]
Description=kubelet: The Kubernetes Node Agent
Documentation=https://kubernetes.io/docs/

[Service]
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/pids/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpu/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuacct/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/cpuset/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/memory/system.slice/kubelet.service
ExecStartPre=/usr/bin/mkdir -p /sys/fs/cgroup/systemd/system.slice/kubelet.service
ExecStart=/usr/bin/kubelet \
  --enforce-node-allocatable=pods,kube-reserved \
  --kube-reserved-cgroup=/system.slice/kubelet.service \
  --kube-reserved=cpu=200m,memory=250Mi \
  --eviction-hard=memory.available<5%,nodefs.available<10%,imagefs.available<10% \
  --eviction-soft=memory.available<10%,nodefs.available<15%,imagefs.available<15% \
  --eviction-soft-grace-period=memory.available=2m,nodefs.available=2m,imagefs.available=2m \
  --eviction-max-pod-grace-period=30 \
  --eviction-minimum-reclaim=memory.available=0Mi,nodefs.available=500Mi,imagefs.available=500Mi
Restart=always
StartLimitInterval=0
RestartSec=10

[Install]
WantedBy=multi-user.target

1.4 Log Management

Use rsyslog or OSS to forward system logs for forensic analysis. Remote backup of node logs is advisable.

1.5 Security Hardening

Implement common security policies such as password expiration, complexity, SSH login limits, session timeout, and command history restrictions.

2. Pod Configuration

Pods are the smallest scheduling unit; proper resource limits, scheduling policies, graceful upgrades, probes, protection strategies, and disruption budgets ensure reliability.

2.1 Resource Limits

Define resources.limits and resources.requests based on workload criticality. Example for a Guaranteed pod:

resources:
  limits:
    memory: "200Mi"
    cpu: "700m"
  requests:
    memory: "200Mi"
    cpu: "700m"

For less critical workloads, use Burstable limits; avoid BestEffort.

2.2 Scheduling Strategies

Use node affinity, required or preferred, to pin pods to specific nodes. Example:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: env
          operator: In
          values:
          - uat

Apply tolerations for tainted nodes and pod anti‑affinity to spread replicas across hosts.

2.3 Graceful Upgrade

Implement a preStop hook to pause traffic before termination, optionally deregistering from a service registry (e.g., Nacos):

lifecycle:
  preStop:
    exec:
      command:
      - /bin/sh
      - -c
      - "curl -X DELETE your_nacos_ip:8848/nacos/v1/ns/instance?serviceName=nacos.test.1&ip=${POD_IP}&port=8880&clusterName=DEFAULT && sleep 15"

2.4 Probe Configuration

Configure livenessProbe, readinessProbe, and optionally startupProbe (available since v1.16) to let kubelet assess pod health. Example:

readinessProbe:
  failureThreshold: 3
  httpGet:
    path: /health
    port: http
    scheme: HTTP
  initialDelaySeconds: 40
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 3
livenessProbe:
  failureThreshold: 3
  httpGet:
    path: /health
    port: http
    scheme: HTTP
  initialDelaySeconds: 60
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 2

Startup probe example:

startupProbe:
  httpGet:
    path: /health
    port: 80
  failureThreshold: 10
  initialDelaySeconds: 10
  periodSeconds: 10

2.5 Protection Strategy

Use a PodDisruptionBudget to limit simultaneous pod evictions. Example:

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: pdb-demo
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: nginx

Only one of minAvailable or maxUnavailable may be set at a time.

3. Logging

Logging spans the entire application lifecycle. Adopt standards for log levels, formats, encoding, output paths, and naming to facilitate collection and analysis.

3.1 Log Collection

Two main approaches:

Deploy a logging agent on each node to capture stdout/stderr streams.

Run a sidecar container within the pod to collect non‑standard output logs.

Standard output collection is preferred for simplicity.

3.2 Log Analysis

Effective analysis helps pinpoint issues; services like Alibaba Cloud Log Service provide powerful query and visualization capabilities.

3.3 Alerting

Define precise log‑based alert keywords to avoid noisy or false alarms, ensuring alerts are actionable.

4. Monitoring

Observability requires cluster‑level, application‑level, event, and tracing monitoring, coupled with alert notifications.

4.1 Cluster Monitoring

Prometheus is the de‑facto solution for Kubernetes cluster metrics; monitor CPU, memory, pod health, etc.

4.2 Application Monitoring

Expose custom metrics in Prometheus format from applications; optionally use Java agents or exporters (e.g., jvm‑exporter, redis‑exporter).

4.3 Event Monitoring

Track Warning and Normal events using tools like kube-eventer to forward events to notification channels.

4.4 Tracing (Link Monitoring)

Use distributed tracing tools such as SkyWalking to visualize request flows across services.

4.5 Alert Notification

Select unique, problem‑reflecting metrics for alerts, classify severity, and route notifications through appropriate channels to ensure timely response.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Kubernetes logging Node Management Pod Configuration

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.