10 Critical Kubernetes Production Failures I Caused and How to Recover
The article walks through ten real‑world Kubernetes production incidents—from an etcd disk‑full disaster to image‑pull failures—detailing symptoms, root‑cause analysis, step‑by‑step remediation commands, and preventive measures such as monitoring, quota alerts, and configuration best practices.
Overview
Background
Kubernetes is the de facto container‑orchestration standard, yet production problems usually arise from misconfiguration, inadequate resource planning, missing monitoring, procedural gaps, or incomplete understanding of core mechanisms.
Article Structure
The article reviews ten real incidents ordered by severity, from cluster‑wide disasters to application‑level issues, and then provides a preventive checklist.
Cluster‑Level Disasters
Case 1: etcd Disk Full (P0 – Disaster)
Symptoms : Critical alerts – API server timeout, many nodes marked Unknown, new pods cannot be scheduled. kubectl hangs.
Root cause : The etcd data directory reached 100 % disk usage because a buggy operator created massive Event objects, etcd had no auto‑compaction, and the disk was only 20 GB.
Resolution steps :
Expand the disk (or clean other files) to free space.
Check etcd status: systemctl status etcd and journalctl -u etcd -n 100.
Compact etcd to a specific revision: ETCDCTL_API=3 etcdctl compact <revision>.
Defragment etcd:
ETCDCTL_API=3 etcdctl defrag --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.keyDelete all stale events: kubectl delete events --all -A.
Prevention :
Enable etcd auto‑compaction (e.g., --auto-compaction-mode=periodic and --auto-compaction-retention=1h).
Set backend quota (e.g., --quota-backend-bytes=8589934592 for 8 GB).
Add a Prometheus alert for disk usage: trigger when
etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes > 0.8for 5 min.
Lesson: etcd is the cluster’s heart; always monitor its disk usage and configure automatic compaction, reserving at least 50 GB.
Case 2: API Server OOM (P0 – Disaster)
Symptoms : Critical alerts – API server process restarts, many pods become Unknown, kubectl responses slow.
Root cause : A script executed kubectl get pods -A -o json every 10 seconds across a 5 000‑pod cluster, generating huge JSON payloads that exhausted API server memory, leading the OOM killer to terminate it.
Resolution steps :
Enable API server audit logging ( --audit-log-path and --audit-policy-file).
Analyze audit logs to locate the heavy request.
Set resource limits for the API server (e.g., requests.memory: "1Gi", limits.memory: "4Gi").
Configure request throttling: --max-requests-inflight=400 and --max-mutating-requests-inflight=200.
Prevention :
Configure API Priority and Fairness (APF) with a low‑priority flow schema for bulk list/watch operations.
Add a Prometheus alert for high API server memory usage (e.g., process_resident_memory_bytes{job="kube-apiserver"} > 3e9).
Lesson: Never run unrestricted list queries on a large production cluster; use label/field selectors or pagination.
Case 3: Certificate Expiration (P0 – Disaster)
Symptoms : All kubectl commands fail with x509: certificate has expired or is not yet valid. The cluster continues to run existing pods but no changes are possible.
Root cause : The kubeadm‑generated certificates have a default one‑year validity and were not renewed; the cluster was over a year old and no expiration alerts were configured.
Resolution steps :
Backup current certificates: cp -r /etc/kubernetes/pki /etc/kubernetes/pki.bak.
Renew all certificates: kubeadm certs renew all.
Restart control‑plane static pods by moving their manifests out and back.
Regenerate kubeconfig for the admin user.
Verify with kubectl get nodes.
Prevention :
Enable certificate‑expiration monitoring (e.g., kube‑prometheus‑stack alert KubernetesCertificateExpiration).
Run a custom script that checks
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -enddateand alerts when less than 30 days remain.
Lesson: After a cluster is created, set up certificate‑expiration alerts and renew certificates at least 30 days before expiry.
Case 4: Node Batch NotReady (P1 – Severe)
Symptoms : Multiple nodes become NotReady within minutes, causing massive pod eviction and service disruption.
Root cause : The container runtime (containerd) hit the system file‑descriptor limit due to an application that generated thousands of log files per second, causing containerd to crash and kubelet to lose communication with the node.
Resolution steps :
Temporarily raise the file‑descriptor limit: ulimit -n 1048576.
Permanently set limits in /etc/security/limits.conf for all users.
Configure systemd limits for containerd ( LimitNOFILE=1048576, LimitNPROC=1048576).
Reload systemd and restart containerd.
Clean up the noisy application’s logs (e.g., find /var/log/pods -name "*.log" -size +100M -delete).
Prevention :
Monitor file‑descriptor usage via node_exporter (alert when node_filefd_allocated / node_filefd_maximum > 0.8).
Configure container log rotation ( --container-log-max-size=100Mi, --container-log-max-files=5).
Lesson: System‑level resource limits (ulimit) are easy to overlook; configure them during node provisioning.
Case 5: Misconfigured Pod Disruption Budget (P1 – Severe)
Symptoms : kubectl drain hangs because eviction is blocked.
Root cause : The PDB set minAvailable: 3 while the Deployment also had 3 replicas, leaving zero allowed disruptions.
Resolution steps :
Temporarily increase replica count ( kubectl scale deployment web-app --replicas=4).
Patch the PDB to a lower minAvailable (e.g.,
kubectl patch pdb web-app-pdb -p '{"spec":{"minAvailable":2}}').
As a last resort, delete the PDB.
Correct PDB configuration (example YAML):
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web-app-pdb
spec:
# Prefer percentage or maxUnavailable
minAvailable: "50%"
# or maxUnavailable: 1
selector:
matchLabels:
app: web-appLesson: Do not set minAvailable equal to the replica count; use a percentage or maxUnavailable instead.
Case 6: ResourceQuota Exhaustion (P2 – Medium)
Symptoms : New pods stay in Pending with events indicating Insufficient cpu and Insufficient memory.
Root cause : The namespace’s ResourceQuota limits (cpu, memory) were already fully consumed.
Resolution steps :
Inspect current usage: kubectl describe resourcequota -n dev.
Identify high‑consumption pods: kubectl top pods -n dev and custom column output.
Either clean up unused resources or increase the quota:
kubectl patch resourcequota dev-quota -n dev -p '{"spec":{"hard":{"limits.cpu":"16","limits.memory":"32Gi"}}}'.
Lesson: After setting a ResourceQuota , add monitoring and alert when usage exceeds 80 %.
Case 7: Rolling Update Stuck (P2 – Medium)
Symptoms : kubectl rollout status deployment/api-server shows that new replicas are not becoming ready; old pods remain running.
Root cause : The new version contained a bug that caused the readiness probe to return HTTP 500, preventing the new pods from becoming ready while the old pods stayed alive.
Resolution steps :
Rollback to the previous version: kubectl rollout undo deployment/api-server.
Verify rollback status, then fix the code and redeploy.
Prevention :
Configure a reasonable progressDeadlineSeconds (e.g., 600 s) and a rolling‑update strategy with maxSurge: 1 and maxUnavailable: 0.
Lesson: Verify health‑check configurations in a test environment before releasing.
Case 8: ConfigMap Hot‑Update Pitfall (P2 – Medium)
Symptoms : After editing a ConfigMap, the running application continues to use the old configuration.
Root cause : Pods do not automatically reload ConfigMap changes. When mounted as environment variables the values never change; when mounted as a volume the files update but the application must watch for changes.
Solution options :
Manually restart the deployment: kubectl rollout restart deployment/app.
Deploy the stakater/Reloader controller and annotate the deployment ( reloader.stakater.com/auto="true") to trigger automatic restarts on ConfigMap changes.
Lesson: Updating a ConfigMap does not trigger pod restarts; use a sidecar/reloader or restart manually.
Case 9: HPA Oscillation Storm (P2 – Medium)
Symptoms : The number of pod replicas jumps dramatically (e.g., 3 → 10 → 3 → 8) within minutes.
Root cause : The HPA thresholds were too sensitive, and the default scaling behavior caused rapid up‑and‑down cycles.
Solution (example HPA spec):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 15Lesson: Configure the behavior field to control scaling speed and avoid flapping.
Case 10: Image Pull Failure Avalanche (P2 – Medium)
Symptoms : Many pods enter ImagePullBackOff or ErrImagePull with errors such as “503 Service Unavailable”.
Root cause : The internal image registry was unavailable, causing all pull attempts to fail.
Solution :
Set imagePullPolicy: IfNotPresent for images that are already cached on nodes.
Configure multiple registry mirrors or enable high‑availability for the registry (e.g., Harbor replication).
Lesson: An image registry is a single point of failure; provide HA or backup registries.
Preventive Checklist
Essential Monitoring & Alerts
etcd disk usage > 80 %.
API server memory > 80 %.
Certificate expiration < 30 days.
Node NotReady for > 5 min.
Pod restart count > 5 per hour.
ResourceQuota usage > 80 %.
Daily Operational Checks (script example)
#!/bin/bash
echo "=== Cluster Health Check ==="
# Node status
echo "--- Nodes ---"
kubectl get nodes | grep -v "Ready"
# System pods
echo "--- System Components ---"
kubectl get pods -n kube-system | grep -v "Running\|Completed"
# Certificate check (kubeadm only)
kubeadm certs check-expiration 2>/dev/null || echo "Non‑kubeadm cluster"
# etcd health
echo "--- etcd Status ---"
kubectl get componentstatuses 2>/dev/null
# Resource usage
echo "--- Resource Usage ---"
kubectl top nodesKey Takeaways (10 Lessons)
etcd: enable auto‑compaction and monitor disk space.
API server: limit large queries and configure APF.
Certificates: set expiration alerts and renew regularly.
Nodes: enforce system resource limits to avoid crashes.
PDB: use percentage or maxUnavailable instead of matching replica count.
ResourceQuota: monitor usage and adjust before hitting limits.
Rolling updates: test health checks thoroughly before production rollout.
ConfigMap: restart pods or use a reloader to apply changes.
HPA: configure stable scaling behavior.
Image registry: provide high‑availability or fallback mirrors.
Conclusion
Each failure described a concrete mistake, the investigative steps taken, the exact commands used for remediation, and concrete preventive actions. By internalising these lessons and adopting the checklist, operators can dramatically reduce the risk of catastrophic outages in Kubernetes production environments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
