10 Kubernetes Ops Pitfalls and How to Avoid Them – Hard‑Earned Lessons
This article shares ten real‑world Kubernetes production pitfalls—ranging from missing resource limits and storage misconfigurations to faulty probes and over‑privileged RBAC—each illustrated with a concrete case, detailed analysis, and actionable mitigation steps to help operators prevent costly outages.
Kubernetes Cluster Operations: 10 Pitfalls and How to Avoid Them
As a Kubernetes operator with three years of experience in the trenches, I have walked around enough pitfalls to circle the globe. I’m sharing these "tuition fees" in hopes of helping you avoid the same detours.
Preface: Why Write This Article?
During the Double‑11 sale, at 2 am our K8s cluster crashed—200+ Pods restarted and the support phone lines exploded. The post‑mortem revealed a preventable low‑level mistake. That moment taught me that operations are as much about experience as technology.
This article summarizes the ten most common and critical pitfalls we encountered in production, each paired with a real case, thorough analysis, and concrete remediation.
Pitfall 1: Improper Resource Configuration – the “Snowflake Effect”
💥 Real Case
# 错误配置示例
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-service
spec:
replicas: 10
template:
spec:
containers:
- name: web
image: nginx:latest
# 没有设置资源限制!Consequence: A memory leak in one Pod consumes node resources, evicting other Pods and causing a cascade failure.
🔧 Mitigation
Enforce resource limits
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"Use LimitRange for automatic injection
apiVersion: v1
kind: LimitRange
metadata:
name: default-limit-range
spec:
limits:
- default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
type: ContainerOps Insight: Setting resource limits in production is non‑negotiable; monitor usage trends with Prometheus and adjust dynamically.
Pitfall 2: Storage Volume Mount “Vanishing Magic”
💥 Real Case
After an upgrade, all database Pod data disappeared because the PVC used the wrong storage class.
# 危险配置
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mysql-pvc
spec:
storageClassName: "standard" # 默认存储类,不持久化!
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi🔧 Mitigation
Specify the storage class explicitly
spec:
storageClassName: "ssd-retain" # 明确指定持久化存储类Set PV reclaim policy
apiVersion: v1
kind: PersistentVolume
metadata:
name: mysql-pv
spec:
persistentVolumeReclaimPolicy: Retain # 保护数据
capacity:
storage: 20Gi
volumeMode: Filesystem
accessModes:
- ReadWriteOnceBackup verification script
#!/bin/bash
# daily-backup-check.sh
kubectl get pvc -A -o wide | grep -v "Bound" && echo "警告:存在未绑定的PVC!"
kubectl get pv | grep "Released" && echo "警告:存在已释放的PV,可能数据丢失!"Pitfall 3: Image Management “Schrödinger State”
💥 Real Case
# 坑爹配置
containers:
- name: app
image: myapp:latest # latest标签,部署时不确定版本
imagePullPolicy: Always # 每次都拉取,网络故障时无法启动During a network hiccup, the image pull failed and the service was down for two hours.
🔧 Mitigation
Use explicit version tags
containers:
- name: app
image: myapp:v1.2.3-20231120 # 明确版本号
imagePullPolicy: IfNotPresentDeploy a high‑availability image registry
# 配置多个镜像仓库
apiVersion: v1
kind: Secret
metadata:
name: regcred-backup
type: kubernetes.io/dockerconfigjson
data:
.dockerconfigjson: <base64-encoded-config>
---
spec:
template:
spec:
imagePullSecrets:
- name: regcred-primary
- name: regcred-backupPitfall 4: Network Policy “Black Hole”
💥 Real Case
After enabling a NetworkPolicy, services could no longer communicate; the misconfiguration took a whole night to debug.
# 过度严格的网络策略
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-all
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
# 没有配置任何允许规则,所有流量被阻断!🔧 Mitigation
Progressive network‑policy rollout
# 第一步:只监控,不阻断
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: web-netpol
annotations:
net.example.com/policy-mode: "monitor" # 先监控模式
spec:
podSelector:
matchLabels:
app: web
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: api
ports:
- protocol: TCP
port: 80Network‑policy testing tools
#!/bin/bash
# netpol-test.sh
echo "测试网络连通性..."
kubectl run test-pod --image=nicolaka/netshoot --rm -it -- /bin/bash
# 在Pod内测试:
# nc -zv <target-service> <port>Ops Tip: Use Calico or Cilium visualizers to see policy effects.
Pitfall 5: Liveness/Readiness Probe Misconfiguration “False Kill”
💥 Real Case
# 激进的探针配置
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5 # 启动延迟太短
periodSeconds: 5 # 检查间隔太短
failureThreshold: 1 # 失败一次就重启
timeoutSeconds: 1 # 超时时间太短The application needed 30 seconds to start, but the probe began checking after 5 seconds, causing continuous restarts.
🔧 Mitigation
Configure probes with sensible parameters
# 温和的探针配置
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60 # 给足启动时间
periodSeconds: 30 # 适中的检查间隔
failureThreshold: 3 # 多次失败才重启
timeoutSeconds: 10 # 合理的超时时间
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3Pitfall 6: Rolling Update Strategy Causing Service Outage
💥 Real Case
# 危险的更新策略
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 50% # 一半Pod同时更新
maxSurge: 0 # 不允许超出副本数The update halved service capacity, leading to a terrible user experience.
🔧 Mitigation
Adopt a conservative rolling‑update policy
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 25% # 最多四分之一不可用
maxSurge: 25% # 允许临时超出副本数
minReadySeconds: 30 # 新Pod稳定30秒后才继续Pitfall 7: Log Collection “Disk Bomb”
💥 Real Case
An application generated massive DEBUG logs without log rotation, eventually filling the node’s disk and making the node unusable.
🔧 Mitigation
Configure log rotation
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
data:
fluent.conf: |
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
<parse>
@type json
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</source>
# 日志过滤,减少存储压力
<filter kubernetes.**>
@type grep
<exclude>
key log
pattern /DEBUG|TRACE/
</exclude>
</filter>Monitor disk usage
#!/bin/bash
# disk-monitor.sh
THRESHOLD=85
NODES=$(kubectl get nodes -o name)
for node in $NODES; do
USAGE=$(kubectl top node $node --no-headers | awk '{print $5}' | tr -d '%')
if [ "$USAGE" -gt "$THRESHOLD" ]; then
echo "警告:节点 $node 磁盘使用率 ${USAGE}% ,超过阈值!"
fi
donePitfall 8: RBAC Privilege Escalation
💥 Real Case
For convenience, a Pod was granted cluster‑admin rights, which was later flagged by the security team as a serious risk.
# 危险配置
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: my-app-binding
subjects:
- kind: ServiceAccount
name: my-app
namespace: default
roleRef:
kind: ClusterRole
name: cluster-admin # 过高的权限!🔧 Mitigation
Apply the principle of least privilege
# 创建最小权限角色
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: default
name: pod-reader
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get","watch","list"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get"]Audit RBAC bindings
#!/bin/bash
# rbac-audit.sh
echo "检查危险的ClusterRoleBinding..."
kubectl get clusterrolebinding -o yaml | grep -A 5 -B 5 "cluster-admin"
echo "检查ServiceAccount权限..."
kubectl get rolebinding,clusterrolebinding --all-namespaces -o widePitfall 9: Node Maintenance Single‑Point Failure
💥 Real Case
Rebooting a node for a kernel upgrade unintentionally took down the database master Pod, causing a brief outage.
🔧 Mitigation
Graceful node maintenance workflow
#!/bin/bash
# node-maintenance.sh
NODE_NAME=$1
echo "1. 检查节点上的关键Pod..."
kubectl get pods --all-namespaces --field-selector spec.nodeName=$NODE_NAME -o wide
echo "2. 标记节点不可调度..."
kubectl cordon $NODE_NAME
echo "3. 等待用户确认..."
read -p "确认要驱逐Pod吗?(y/N) " -n 1 -r
if [[ $REPLY =~ ^[Yy]$ ]]; then
echo "4. 驱逐Pod..."
kubectl drain $NODE_NAME --ignore-daemonsets --delete-emptydir-data --grace-period=300
fi
echo "5. 节点已准备好维护"Pod anti‑affinity to spread critical workloads
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- database
topologyKey: kubernetes.io/hostnamePitfall 10: Alert Fatigue – “The Boy Who Cried Wolf”
💥 Real Case
Overly sensitive alert rules generated hundreds of alerts daily, causing real incidents to be ignored.
# 过于敏感的告警规则
- alert: HighCPUUsage
expr: cpu_usage > 50%
for: 1m
labels:
severity: critical
# 阈值过低、持续时间太短、级别过高🔧 Mitigation
Reasonable alert severity levels
# Prometheus告警规则
groups:
- name: kubernetes-apps
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 重启频繁"
- alert: PodNotReady
expr: kube_pod_status_ready{condition="false"} == 1
for: 10m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 长时间未就绪"Alert deduplication script
#!/bin/bash
# alert-dedup.sh
# 合并相似告警,减少噪音
kubectl get events --sort-by='.lastTimestamp' |
grep -E "Warning|Error" |
awk '{print $4, $5, $6}' |
sort | uniq -c | sort -nrOperations Best‑Practice Summary
After these hard‑earned lessons, I distilled several golden rules for Kubernetes operations:
🎯 Prevention First
Resource limits are mandatory – be conservative, not aggressive.
Configure probes sensibly – give applications enough startup and response time.
Principle of least privilege – use Role instead of ClusterRole whenever possible.
🔍 Monitoring First
Comprehensive monitoring – cover nodes, Pods, network, storage.
Reasonable alerts – reduce noise, highlight critical issues.
Regular health checks – automate cluster health inspections.
🛡️ Failure Drills
Chaos engineering – intentionally inject failures to test resilience.
Backup verification – regularly test restore procedures.
Incident response playbooks – define detailed handling steps.
📚 Documentation
Operation logs – record every change.
Knowledge base – turn pitfall experiences into docs.
Team training – share best practices regularly.
Kubernetes operations is a continuous learning journey; every pitfall is an opportunity to grow. I hope this article helps those navigating the K8s landscape. If you have similar experiences, feel free to share them in the comments so we can all improve together!
Remember: In production there are no small problems, only major incidents. Every detail can determine system stability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
