Master Kubernetes Troubleshooting: From Pod Crashes to Network Failures
This comprehensive guide walks you through Kubernetes fault‑tolerance by covering core components, classifying six major failure types, presenting a three‑step troubleshooting methodology, and detailing six real‑world case studies with commands, manifests, monitoring setups and preventive best practices.
Introduction
Kubernetes is the de‑facto standard for container orchestration, but its complex architecture introduces operational challenges. Over 60% of production incidents stem from mis‑configurations or resource‑management errors. This guide presents a systematic methodology for diagnosing issues across pod lifecycle, networking, storage, and scheduling, illustrated with six real‑world cases.
Technical Background
Kubernetes Architecture Overview
Control‑plane components
kube-apiserver : entry point for all REST requests.
etcd : single source of truth for cluster state.
kube-scheduler : makes pod‑placement decisions.
kube-controller-manager : runs built‑in controllers such as ReplicaSet and Deployment.
cloud-controller-manager : integrates cloud‑provider APIs.
Node components
kubelet : agent that manages pod lifecycle on each node.
kube-proxy : maintains network rules and implements Service abstraction.
Container runtime : Docker, containerd, CRI‑O, etc.
Common Failure Types
Pod status anomalies – Pending, CrashLoopBackOff, ImagePullBackOff, Error.
Resource scheduling problems – insufficient resources, affinity conflicts, taint‑toleration mismatches.
Network communication failures – Service unreachable, DNS resolution errors, cross‑node connectivity issues.
Storage mounting problems – PVC binding failures, mount timeouts, permission errors.
Node‑level faults – NotReady, disk pressure, memory pressure.
Configuration errors – YAML syntax, RBAC issues, improper resource limits.
Fault‑Diagnosis Methodology
Three‑step approach
Step 1 – Information Collection
# List pods with wide output
kubectl get pods -o wide
kubectl describe pod POD_NAME
kubectl get events --sort-by='.lastTimestamp'Step 2 – Log Analysis
# View container logs
kubectl logs POD_NAME
kubectl logs POD_NAME -c CONTAINER_NAME
kubectl logs POD_NAME --previous # previous crashed container logsStep 3 – Deep Diagnosis
# Exec into the container
kubectl exec -it POD_NAME -- /bin/sh
# Inspect node status
kubectl describe node NODE_NAMECore Content
Pod Lifecycle and Status
Pod phases and typical causes:
Status Meaning Common Causes
Pending Scheduling or resource wait Insufficient resources, image pull, volume not ready
Running Normal operation -
Succeeded Job/CronJob completed -
Failed Execution failure Container exit code non‑zero
Unknown Unable to obtain status Node communication failure
CrashLoopBackOff Repeated crashes Application start failure, failed health checks
ImagePullBackOff Image pull failure Image not found, authentication error, network issueUseful inspection commands:
# Detailed pod status
kubectl get pod POD_NAME -o yaml | grep -A 10 status
# Container restart count
kubectl get pod POD_NAME -o jsonpath='{.status.containerStatuses[*].restartCount}'
# Container readiness
kubectl get pod POD_NAME -o jsonpath='{.status.containerStatuses[*].ready}'Common Troubleshooting Commands
Basic information
# List all pods across namespaces
kubectl get pods -A
# Wide view with IP, node, start time
kubectl get pods -o wide -n NAMESPACE
# Full YAML of a pod
kubectl get pod POD_NAME -o yaml
# Most used debugging command
kubectl describe pod POD_NAME -n NAMESPACELog inspection
# Tail last 100 lines
kubectl logs POD_NAME --tail=100
# Follow logs (like tail -f)
kubectl logs -f POD_NAME
# Specific container in multi‑container pod
kubectl logs POD_NAME -c CONTAINER_NAME
# All containers
kubectl logs POD_NAME --all-containers=true
# Previous (crashed) container logs
kubectl logs POD_NAME --previous
# Add timestamps
kubectl logs POD_NAME --timestamps=true
# Logs from the last hour
kubectl logs POD_NAME --since=1hEvent view
# Cluster‑wide events sorted by time
kubectl get events --sort-by='.lastTimestamp'
# Namespace‑specific events
kubectl get events -n NAMESPACE --sort-by='.lastTimestamp'
# Events related to a specific pod
kubectl get events --field-selector involvedObject.name=POD_NAME
# Warning‑level events only
kubectl get events --field-selector type=WarningResource usage
# Node resource usage (requires metrics‑server)
kubectl top nodes
# Pod resource usage
kubectl top pods -n NAMESPACE
# Specific pod resource usage (including containers)
kubectl top pod POD_NAME --containersLog Analysis Techniques
Check container exit codes:
kubectl get pod POD_NAME -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'0 – normal exit
1 – application error
137 – OOMKilled
143 – SIGTERM (graceful stop)
255 – exit code out of range
Detect OOMKilled:
# Search for OOMKilled in pod description
kubectl describe pod POD_NAME | grep -i "OOMKilled"
# Verify termination reason
kubectl get pod POD_NAME -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'Aggregate logs across pods:
# View logs of multiple pods
kubectl logs -l app=nginx --tail=50
# Use stern (recommended)
stern POD_PREFIX -n NAMESPACEResource Limits and Scheduling Issues
Example resource request/limit manifest:
apiVersion: v1
kind: Pod
metadata:
name: resource-demo
spec:
containers:
- name: app
image: nginx
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"Commands to investigate scheduling failures:
# Show why a pod cannot be scheduled
kubectl describe pod POD_NAME | grep -A 5 "Events"
# Show node available resources
kubectl describe nodes | grep -A 5 "Allocated resources"
# Show node labels and taints
kubectl get nodes --show-labels
kubectl describe node NODE_NAME | grep TaintsTypical failure messages:
# Insufficient memory
0/3 nodes are available: 3 Insufficient memory.
# Node selector mismatch
0/3 nodes are available: 3 node(s) didn't match node selector.
# Taint‑toleration mismatch
0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.Storage and Persistence Problems
PVC status commands:
# List PVCs
kubectl get pvc -n NAMESPACE
# Detailed PVC info
kubectl describe pvc PVC_NAME
# List PVs
kubectl get pv
# Detailed PV info
kubectl describe pv PV_NAMEExample PVC manifest (ReadWriteOnce):
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mysql-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
storageClassName: standardTo switch to a multi‑writer volume (if supported):
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mysql-data
spec:
accessModes:
- ReadWriteMany # RWX
resources:
requests:
storage: 20Gi
storageClassName: nfs-clientNetwork Troubleshooting Tools and Methods
Service connectivity tests:
# Service details
kubectl get svc -o wide
kubectl describe svc SERVICE_NAME
# Endpoints
kubectl get endpoints SERVICE_NAME
# Test from inside a pod
kubectl exec -it POD_NAME -- curl SERVICE_NAME:PORT
kubectl exec -it POD_NAME -- nslookup SERVICE_NAME
# Cross‑namespace access
kubectl exec -it POD_NAME -- curl SERVICE_NAME.NAMESPACE.svc.cluster.localNetworkPolicy inspection:
# List and describe policies
kubectl get networkpolicies -n NAMESPACE
kubectl describe networkpolicy POLICY_NAMEDNS troubleshooting:
# CoreDNS pod status
kubectl get pods -n kube-system -l k8s-app=kube-dns
# CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
# Test DNS from a pod
kubectl exec -it POD_NAME -- nslookup kubernetes.defaultPractical Cases
Case 1 – CrashLoopBackOff
Symptom
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
webapp-deployment-7d8f9c 0/1 CrashLoopBackOff 5 3mInvestigation
Describe pod – notice BackOff warning.
Check logs – panic due to database connection refusal.
Inspect exit code – 2 (application error).
Root cause : Application cannot reach MySQL service.
Solution
Verify MySQL service and endpoints.
# Service
kubectl get svc mysql-service
# Endpoints
kubectl get endpoints mysql-serviceUpdate deployment to use correct service name and add health probes.
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp-deployment
spec:
replicas: 3
selector:
matchLabels:
app: webapp
template:
metadata:
labels:
app: webapp
spec:
containers:
- name: webapp
image: myapp:v1.0
env:
- name: DB_HOST
value: "mysql-service"
- name: DB_RETRY_INTERVAL
value: "5"
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5Apply changes and confirm pods run.
# Apply
kubectl apply -f webapp-deployment.yaml
# Watch pods
kubectl get pods -wCase 2 – ImagePullBackOff
Symptom
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-app-5d7f8b 0/1 ImagePullBackOff 0 2mInvestigation
Describe pod – error shows pull access denied for private registry.
Solution
Create a Docker registry secret.
kubectl create secret docker-registry harbor-secret \
--docker-server=harbor.company.com \
--docker-username=admin \
--docker-password=Harbor12345 \
[email protected] -n defaultReference the secret in the Deployment.
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-app
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
imagePullSecrets:
- name: harbor-secret
containers:
- name: nginx
image: harbor.company.com/prod/nginx:v2.0
ports:
- containerPort: 80Patch the default ServiceAccount to use the secret.
kubectl patch serviceaccount default -p '{"imagePullSecrets": [{"name": "harbor-secret"}]}'After applying, the pod reaches Running state.
Case 3 – Service Unreachable (Network Fault)
Symptom
# From a client pod
kubectl exec -it client-pod -- curl backend-service:8080
curl: (6) Could not resolve host: backend-serviceInvestigation
Service has no endpoints.
kubectl get svc backend-service -o wide
kubectl describe svc backend-servicePod labels do not match Service selector (app=backend vs app=backend-app).
Resolution
Fix Service selector to match pod labels.
apiVersion: v1
kind: Service
metadata:
name: backend-service
spec:
selector:
app: backend-app # corrected
ports:
- protocol: TCP
port: 8080
targetPort: 8080Or adjust pod labels to app=backend.
apiVersion: apps/v1
kind: Deployment
metadata:
name: backend-deploy
spec:
selector:
matchLabels:
app: backend
template:
metadata:
labels:
app: backend
spec:
containers:
- name: backend
image: backend:v1.0
ports:
- containerPort: 8080Verification shows endpoints populated and curl succeeds.
Case 4 – Node NotReady
Symptom
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
node-2 NotReady worker 30d v1.24.0Investigation
Describe node – DiskPressure true, container runtime not ready, network plugin not ready.
Check kubelet and containerd services; containerd is failed.
Disk usage > 95% on root and containerd directory.
Resolution Steps
Clean disk space (prune images, remove old containers, delete old logs).
# Prune unused images
crictl rmi --prune
# Remove stopped containers
crictl rm $(crictl ps -a -q --state=Exited)
# Delete old pod logs
find /var/log/pods -name "*.log" -mtime +7 -delete
journalctl --vacuum-time=7dRestart containerd and kubelet.
systemctl restart containerd
systemctl restart kubeletVerify node status returns to Ready.
kubectl get nodes
kubectl describe node node-2 | grep -A 5 ConditionsCase 5 – Scheduling Failure Due to Resource Shortage
Symptom
$ kubectl get pods
java-app-deployment-7f8d 0/1 Pending 0 5mInvestigation
Events show Insufficient cpu/memory on all nodes.
Pod requests 4Gi memory, 2000m CPU.
Node allocated resources indicate no node has enough free memory.
Solutions
Reduce resource requests (recommended).
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"Scale out the cluster (e.g., add nodes via cloud provider).
# Example for Alibaba Cloud
aliyun cs ScaleOutCluster --ClusterId=c1234567890 --count=2 --worker-instance-types=ecs.g6.2xlargeFree up resources: delete unused pods, reduce replica counts.
# Find high‑memory pods
kubectl top pods -A --sort-by=memory | head -20
# Delete unnecessary pod
kubectl delete pod UNUSED_POD -n NAMESPACE
# Scale deployment
kubectl scale deployment DEPLOYMENT_NAME --replicas=1Case 6 – PVC Mount Failure (Multi‑Attach Error)
Symptom
$ kubectl get pods
mysql-statefulset-0 0/1 ContainerCreating 0 3mInvestigation
Describe pod – warning: FailedAttachVolume, volume already attached to another node.
PVC is bound to a RWO volume.
PV shows node affinity to a specific zone.
Resolution Options
Force delete the old pod that still holds the volume.
kubectl delete pod mysql-statefulset-0 --grace-period=0 --forceManually unmount the volume on the node (use with caution).
# SSH to node
ssh root@node-2
# Find mount point
mount | grep pvc-abc123
# Unmount
umount /var/lib/kubelet/pods/.../volumes/kubernetes.io~aws-ebs/pvc-abc123Switch the PVC to ReadWriteMany if the storage class supports it.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mysql-data
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 20Gi
storageClassName: nfs-clientBest Practices
Monitoring and Alerting
Prometheus + Grafana rule example for pod restarts and not‑ready pods:
# alerts.yaml
groups:
- name: kubernetes-pods
rules:
- alert: PodRestartingTooOften
expr: rate(kube_pod_container_status_restarts_total[1h]) > 3
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} restarts too frequently"
- alert: PodNotReady
expr: kube_pod_status_phase{phase!~"Running|Succeeded"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is not ready"
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} NotReady"
- alert: NodeDiskPressure
expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.node }} under disk pressure"Log Collection (EFK Stack)
Fluentd DaemonSet example:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: kube-system
spec:
selector:
matchLabels:
name: fluentd
template:
metadata:
labels:
name: fluentd
spec:
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "elasticsearch.logging.svc.cluster.local"
- name: FLUENT_ELASTICSEARCH_PORT
value: "9200"
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containersToolbox
kubectl-debug – install and use for on‑the‑fly debugging.
# Install
curl -Lo kubectl-debug.tar.gz https://github.com/aylei/kubectl-debug/releases/download/v0.1.1/kubectl-debug_0.1.1_linux_amd64.tar.gz
tar -zxvf kubectl-debug.tar.gz kubectl-debug
mv kubectl-debug /usr/local/bin/
# Debug a pod
kubectl debug POD_NAME --agentless --port-forward=truestern – tail logs from multiple pods.
# Install
wget https://github.com/stern/stern/releases/download/v1.22.0/stern_1.22.0_linux_amd64.tar.gz
tar -zxvf stern_1.22.0_linux_amd64.tar.gz
mv stern /usr/local/bin/
# Example usage
stern -n production backend-*
stern -l app=nginxnetshoot – a network‑debug pod.
apiVersion: v1
kind: Pod
metadata:
name: netshoot
spec:
containers:
- name: netshoot
image: nicolaka/netshoot
command: ["sleep", "3600"]Preventive Measures
ResourceQuota to cap CPU, memory, PVC count.
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: production
spec:
hard:
requests.cpu: "100"
requests.memory: 200Gi
limits.cpu: "200"
limits.memory: 400Gi
persistentvolumeclaims: "50"PodDisruptionBudget to guarantee availability.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: backend-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: backendRobust liveness, readiness and startup probes.
apiVersion: v1
kind: Pod
metadata:
name: webapp
spec:
containers:
- name: app
image: webapp:v1.0
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
startupProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 0
periodSeconds: 10
failureThreshold: 30 # allows up to 5 min start timeDaily health‑check script.
#!/bin/bash
# k8s_health_check.sh
echo "=== Node Status ==="
kubectl get nodes -o wide
echo -e "
=== Abnormal Pods ==="
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
echo -e "
=== Pods with High Restarts ==="
kubectl get pods -A -o json | jq -r '.items[] | select(.status.containerStatuses[]?.restartCount > 5) | "\(.metadata.namespace)/\(.metadata.name) - Restarts: \(.status.containerStatuses[0].restartCount)"'
echo -e "
=== Top Resource Consumers ==="
kubectl top nodes
kubectl top pods -A --sort-by=memory | head -10
echo -e "
=== Recent Warning Events ==="
kubectl get events -A --sort-by='.lastTimestamp' | grep -i "warning\|error" | tail -20Conclusion and Outlook
Kubernetes troubleshooting is a systematic engineering discipline that requires a solid grasp of architecture, hands‑on command proficiency, and proactive monitoring. By mastering the three‑step method, leveraging the presented commands, and applying the best‑practice checklist, operators can quickly locate and resolve issues while building preventive safeguards.
Systematic workflow: information collection → log analysis → deep diagnosis.
Tool mastery: kubectl, describe, logs, events, top, and auxiliary tools like stern and netshoot.
Understanding of underlying mechanisms: pod scheduling, network model, storage binding.
Established monitoring: Prometheus alerts + EFK log pipeline.
Preventive mindset: resource quotas, health probes, PDBs, regular health‑check scripts.
Future trends shaping Kubernetes operations include AIOps for predictive fault detection, eBPF‑based deep observability (Cilium, Pixie), service‑mesh enhancements (Istio, Linkerd), GitOps workflows (Argo CD, Flux) for declarative configuration and automated rollbacks, and edge‑native extensions (KubeEdge) expanding Kubernetes to edge devices.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
