Operations 18 min read

10 Kubernetes Ops Pitfalls and How to Avoid Them – Hard‑Earned Lessons

This article shares ten real‑world Kubernetes production pitfalls—ranging from missing resource limits and storage misconfigurations to faulty probes and over‑privileged RBAC—each illustrated with a concrete case, detailed analysis, and actionable mitigation steps to help operators prevent costly outages.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
10 Kubernetes Ops Pitfalls and How to Avoid Them – Hard‑Earned Lessons

Kubernetes Cluster Operations: 10 Pitfalls and How to Avoid Them

As a Kubernetes operator with three years of experience in the trenches, I have walked around enough pitfalls to circle the globe. I’m sharing these "tuition fees" in hopes of helping you avoid the same detours.

Preface: Why Write This Article?

During the Double‑11 sale, at 2 am our K8s cluster crashed—200+ Pods restarted and the support phone lines exploded. The post‑mortem revealed a preventable low‑level mistake. That moment taught me that operations are as much about experience as technology.

This article summarizes the ten most common and critical pitfalls we encountered in production, each paired with a real case, thorough analysis, and concrete remediation.

Pitfall 1: Improper Resource Configuration – the “Snowflake Effect”

💥 Real Case

# 错误配置示例
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-service
spec:
  replicas: 10
  template:
    spec:
      containers:
      - name: web
        image: nginx:latest
        # 没有设置资源限制!

Consequence: A memory leak in one Pod consumes node resources, evicting other Pods and causing a cascade failure.

🔧 Mitigation

Enforce resource limits

resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    cpu: "500m"

Use LimitRange for automatic injection

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limit-range
spec:
  limits:
  - default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "100m"
      memory: "128Mi"
    type: Container

Ops Insight: Setting resource limits in production is non‑negotiable; monitor usage trends with Prometheus and adjust dynamically.

Pitfall 2: Storage Volume Mount “Vanishing Magic”

💥 Real Case

After an upgrade, all database Pod data disappeared because the PVC used the wrong storage class.

# 危险配置
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mysql-pvc
spec:
  storageClassName: "standard"  # 默认存储类,不持久化!
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi

🔧 Mitigation

Specify the storage class explicitly

spec:
  storageClassName: "ssd-retain"  # 明确指定持久化存储类

Set PV reclaim policy

apiVersion: v1
kind: PersistentVolume
metadata:
  name: mysql-pv
spec:
  persistentVolumeReclaimPolicy: Retain  # 保护数据
  capacity:
    storage: 20Gi
  volumeMode: Filesystem
  accessModes:
  - ReadWriteOnce

Backup verification script

#!/bin/bash
# daily-backup-check.sh
kubectl get pvc -A -o wide | grep -v "Bound" && echo "警告:存在未绑定的PVC!"
kubectl get pv | grep "Released" && echo "警告:存在已释放的PV,可能数据丢失!"

Pitfall 3: Image Management “Schrödinger State”

💥 Real Case

# 坑爹配置
containers:
- name: app
  image: myapp:latest  # latest标签,部署时不确定版本
  imagePullPolicy: Always  # 每次都拉取,网络故障时无法启动

During a network hiccup, the image pull failed and the service was down for two hours.

🔧 Mitigation

Use explicit version tags

containers:
- name: app
  image: myapp:v1.2.3-20231120  # 明确版本号
  imagePullPolicy: IfNotPresent

Deploy a high‑availability image registry

# 配置多个镜像仓库
apiVersion: v1
kind: Secret
metadata:
  name: regcred-backup
type: kubernetes.io/dockerconfigjson
data:
  .dockerconfigjson: <base64-encoded-config>
---
spec:
  template:
    spec:
      imagePullSecrets:
      - name: regcred-primary
      - name: regcred-backup

Pitfall 4: Network Policy “Black Hole”

💥 Real Case

After enabling a NetworkPolicy, services could no longer communicate; the misconfiguration took a whole night to debug.

# 过度严格的网络策略
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  # 没有配置任何允许规则,所有流量被阻断!

🔧 Mitigation

Progressive network‑policy rollout

# 第一步:只监控,不阻断
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: web-netpol
  annotations:
    net.example.com/policy-mode: "monitor"  # 先监控模式
spec:
  podSelector:
    matchLabels:
      app: web
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: api
    ports:
    - protocol: TCP
      port: 80

Network‑policy testing tools

#!/bin/bash
# netpol-test.sh
echo "测试网络连通性..."
kubectl run test-pod --image=nicolaka/netshoot --rm -it -- /bin/bash
# 在Pod内测试:
# nc -zv <target-service> <port>

Ops Tip: Use Calico or Cilium visualizers to see policy effects.

Pitfall 5: Liveness/Readiness Probe Misconfiguration “False Kill”

💥 Real Case

# 激进的探针配置
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 5   # 启动延迟太短
  periodSeconds: 5         # 检查间隔太短
  failureThreshold: 1      # 失败一次就重启
  timeoutSeconds: 1        # 超时时间太短

The application needed 30 seconds to start, but the probe began checking after 5 seconds, causing continuous restarts.

🔧 Mitigation

Configure probes with sensible parameters

# 温和的探针配置
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 60   # 给足启动时间
  periodSeconds: 30         # 适中的检查间隔
  failureThreshold: 3        # 多次失败才重启
  timeoutSeconds: 10        # 合理的超时时间
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

Pitfall 6: Rolling Update Strategy Causing Service Outage

💥 Real Case

# 危险的更新策略
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 50%  # 一半Pod同时更新
      maxSurge: 0           # 不允许超出副本数

The update halved service capacity, leading to a terrible user experience.

🔧 Mitigation

Adopt a conservative rolling‑update policy

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%   # 最多四分之一不可用
      maxSurge: 25%          # 允许临时超出副本数
      minReadySeconds: 30   # 新Pod稳定30秒后才继续

Pitfall 7: Log Collection “Disk Bomb”

💥 Real Case

An application generated massive DEBUG logs without log rotation, eventually filling the node’s disk and making the node unusable.

🔧 Mitigation

Configure log rotation

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
data:
  fluent.conf: |
    <source>
      @type tail
      path /var/log/containers/*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag kubernetes.*
      read_from_head true
      <parse>
        @type json
        time_format %Y-%m-%dT%H:%M:%S.%NZ
      </parse>
    </source>
    # 日志过滤,减少存储压力
    <filter kubernetes.**>
      @type grep
      <exclude>
        key log
        pattern /DEBUG|TRACE/
      </exclude>
    </filter>

Monitor disk usage

#!/bin/bash
# disk-monitor.sh
THRESHOLD=85
NODES=$(kubectl get nodes -o name)
for node in $NODES; do
  USAGE=$(kubectl top node $node --no-headers | awk '{print $5}' | tr -d '%')
  if [ "$USAGE" -gt "$THRESHOLD" ]; then
    echo "警告:节点 $node 磁盘使用率 ${USAGE}% ,超过阈值!"
  fi
done

Pitfall 8: RBAC Privilege Escalation

💥 Real Case

For convenience, a Pod was granted cluster‑admin rights, which was later flagged by the security team as a serious risk.

# 危险配置
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: my-app-binding
subjects:
- kind: ServiceAccount
  name: my-app
  namespace: default
roleRef:
  kind: ClusterRole
  name: cluster-admin  # 过高的权限!

🔧 Mitigation

Apply the principle of least privilege

# 创建最小权限角色
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: pod-reader
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get","watch","list"]
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["get"]

Audit RBAC bindings

#!/bin/bash
# rbac-audit.sh
echo "检查危险的ClusterRoleBinding..."
kubectl get clusterrolebinding -o yaml | grep -A 5 -B 5 "cluster-admin"

echo "检查ServiceAccount权限..."
kubectl get rolebinding,clusterrolebinding --all-namespaces -o wide

Pitfall 9: Node Maintenance Single‑Point Failure

💥 Real Case

Rebooting a node for a kernel upgrade unintentionally took down the database master Pod, causing a brief outage.

🔧 Mitigation

Graceful node maintenance workflow

#!/bin/bash
# node-maintenance.sh
NODE_NAME=$1

echo "1. 检查节点上的关键Pod..."
kubectl get pods --all-namespaces --field-selector spec.nodeName=$NODE_NAME -o wide

echo "2. 标记节点不可调度..."
kubectl cordon $NODE_NAME

echo "3. 等待用户确认..."
read -p "确认要驱逐Pod吗?(y/N) " -n 1 -r
if [[ $REPLY =~ ^[Yy]$ ]]; then
  echo "4. 驱逐Pod..."
  kubectl drain $NODE_NAME --ignore-daemonsets --delete-emptydir-data --grace-period=300
fi

echo "5. 节点已准备好维护"

Pod anti‑affinity to spread critical workloads

spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - database
        topologyKey: kubernetes.io/hostname

Pitfall 10: Alert Fatigue – “The Boy Who Cried Wolf”

💥 Real Case

Overly sensitive alert rules generated hundreds of alerts daily, causing real incidents to be ignored.

# 过于敏感的告警规则
- alert: HighCPUUsage
  expr: cpu_usage > 50%
  for: 1m
  labels:
    severity: critical
  # 阈值过低、持续时间太短、级别过高

🔧 Mitigation

Reasonable alert severity levels

# Prometheus告警规则
groups:
- name: kubernetes-apps
  rules:
  - alert: PodCrashLooping
    expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 重启频繁"
  - alert: PodNotReady
    expr: kube_pod_status_ready{condition="false"} == 1
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 长时间未就绪"

Alert deduplication script

#!/bin/bash
# alert-dedup.sh
# 合并相似告警,减少噪音
kubectl get events --sort-by='.lastTimestamp' |
  grep -E "Warning|Error" |
  awk '{print $4, $5, $6}' |
  sort | uniq -c | sort -nr

Operations Best‑Practice Summary

After these hard‑earned lessons, I distilled several golden rules for Kubernetes operations:

🎯 Prevention First

Resource limits are mandatory – be conservative, not aggressive.

Configure probes sensibly – give applications enough startup and response time.

Principle of least privilege – use Role instead of ClusterRole whenever possible.

🔍 Monitoring First

Comprehensive monitoring – cover nodes, Pods, network, storage.

Reasonable alerts – reduce noise, highlight critical issues.

Regular health checks – automate cluster health inspections.

🛡️ Failure Drills

Chaos engineering – intentionally inject failures to test resilience.

Backup verification – regularly test restore procedures.

Incident response playbooks – define detailed handling steps.

📚 Documentation

Operation logs – record every change.

Knowledge base – turn pitfall experiences into docs.

Team training – share best practices regularly.

Kubernetes operations is a continuous learning journey; every pitfall is an opportunity to grow. I hope this article helps those navigating the K8s landscape. If you have similar experiences, feel free to share them in the comments so we can all improve together!

Remember: In production there are no small problems, only major incidents. Every detail can determine system stability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Kubernetesbest practices
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.